When NVIDIA announces a new GPU, they brag about the number of TFLOPS, but what does that actually mean?
FLOPS, or floating-point operations per second, measures how many mathematical calculations a processor can perform on decimal numbers per second. It's particularly important for tasks like scientific simulations, AI training, and graphics rendering.
Varieties Of FLOPS
It’s confusing to see FLOPS and TFLOPS thrown around without clear distinction.
The different prefixes denote different scales:
FLOPS: Base unit (Floating Point Operations Per Second)
KFLOPS: Kilo (thousand) FLOPS - 10³ FLOPS
MFLOPS: Mega (million) FLOPS - 10⁶ FLOPS
GFLOPS: Giga (billion) FLOPS - 10⁹ FLOPS
TFLOPS: Tera (trillion) FLOPS - 10¹² FLOPS
When you see FLOPS prefixed with some letter, it just means some huge multiple of FLOPS.
What Is A Floating Point Operation?
A floating point operation refers to a mathematical calculation performed on floating point numbers - numbers that have a decimal point that can "float" to different positions (like 123.45 or 1.2345×10²).
The basic floating point operations include:
Addition and subtraction
Multiplication
Division
Square root
Each of these counts as one floating point operation. So when we see something like "9.7 TFLOPS" for the NVIDIA A100 GPU, it means it can theoretically perform 9.7 trillion of these basic operations per second.
Note: Modern GPUs can sometimes perform multiple operations (like multiply-add) in a single step, which affects how FLOPS are counted.
Considering Floating Point “Precision”
You may see floating point precision discussed with respect to GPU peformance:
Example: A100 GPU is 9.7 TFLOPS for 64-bit double-precision, 156 TFLOPS for 32-bit single-precision.
To understand this, consider that computer can store numbers floating at different precisions. For example, if you’re storing a rounded π value, 3.141 will take up less memory than 3.1415927.
This is why different precisions exist:
Half-precision (16-bit): Less storage, less precise, faster calculations
Example: 3.141 (about 3-4 digits)
Single-precision (32-bit): Medium storage, medium precision
Example: 3.1415927 (about 7-8 digits)
Double-precision (64-bit): More storage, more precise, slower calculations
Example: 3.141592653589793 (about 15-17 digits)
For example, video games often use single-precision, while scientific simulations might require double-precision.
The need for different levels of precision explains why the A100 GPU shows different FLOPS - it can do many more operations per second when working with smaller, less precise numbers (half-precision) compared to larger, more precise ones (double-precision).
FLOPS Measurements Are Theoretical Upper Limits
These are not necessarily application speeds; they are merely the raw speeds that the execution resources can potentially support in these chips.
It's like having a car with a top speed of 200 mph - while that's the peak capability, you'll rarely if ever actually drive at that speed in practical conditions.
Why GPUs Can't Always Hit Peak FLOPS
Why can’t the processors actually just constantly hit the theoretical limit?
The biggest limiting factor is often memory bandwidth - how fast data can be fed to the processor. Even if a GPU can do calculations extremely quickly, it might be sitting idle waiting for data to arrive from memory.
Another reason is real applications need to do things besides just multiply matrixes.
Real applications include overhead for:
Thread management
Memory management
Program control flow
Operating system operations
These operations don't contribute to the floating-point calculation throughput but take up time.
It's similar to how a car's top speed (peak performance) is rarely achievable in real driving conditions due to factors like traffic, road conditions, weather, and the need to brake and turn - the real-world application speed is typically much lower than the theoretical peak capability.