Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide benchmark with throughput units (GFlops/s TFlops/s) #26

Open
mratsim opened this issue Apr 20, 2024 · 1 comment
Open

Provide benchmark with throughput units (GFlops/s TFlops/s) #26

mratsim opened this issue Apr 20, 2024 · 1 comment

Comments

@mratsim
Copy link

mratsim commented Apr 20, 2024

Hello fellow gemm optimizer enthusiast,

It would be extremely useful to provide benchmark utilities, ideally in GFlop/s TFlop/s to compare with other frameworks, compare with the CPU peak theoretical throughput and also linpack.

The formula for MxK multiplied by KxN matrices is:

  • total required operations: M*K*N*2 2 for 1mul and 1add
  • divided by time taken

Additionally you might want to check the required data to derive arithmetic intensity for the roofline model:

  • required data: M*K+K*N

And finally you might also want to check your theoretical peak like: https:/mratsim/weave/blob/b6255af/benchmarks/matmul_gemm_blas/gemm_bench_config.nim#L5-L18

const
  CpuGhz = 3.5      # i9-9980XE OC All turbo 4.1GHz (AVX2 4.0GHz, AVX512 3.5GHz)
  NumCpuCores = 18
  VectorWidth = 16  # 8 float32 for AVX2, 16 for AVX512
  InstrCycle = 2    # How many instructions per cycle, (2xFMAs or 1xFMA for example)
  FlopInstr = 2     # How many FLOP per instr (FMAs = 1 add + 1 mul)

  TheoSerialPeak* = CpuGhz * VectorWidth * InstrCycle * FlopInstr
  TheoThreadedPeak* = TheoSerialPeak * NumCpuCores

FYI, you might be interested in my own research in cache utilization tuning, though skimming a bit I see that you tuned at the cache associativity-level while I used some heuristics:

Benchmarks in my own implementation+OpenMP and OpenBLAS/MKL and MKL-DNN (Latest oneDNN was too entangled to extract the relevant GEMM primitives):

Benchmarks with my own multithreading runtime (instead of OpenMP)

@sarah-quinones
Copy link
Owner

thanks for the suggestion. I'll set up something for that soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants