Skip to main content

Benchmark LLM on Apple Silicon

We benchmark the token generation speed or inference performance of large language models (LLMs) on Apple Silicon. I am trying to answer this question: what is the best LLM that runs on an Apple MacBook Pro with a 16 GB M1 Pro chip.

Prompt: write a long story about little red riding hood

SoftwareModelTokens / second
Ollamagemma3:4b (a2af6cc3eb7f)34.64
Ollamagemma3:12b (f4031aab637d)13.92
LM Studiomlx-community/gemma-3-12b-it-4bit (*)17.02
Ollamadeepseek-r1:14b (ea35dfe18182)12.87
Ollamaphi4-mini:3.8b-q4_K_M (78fad5d182a7)35.09
Ollamaphi4:14b-q4_K_M (ac896e5b8b34)12.75
LM Studiomlx-community/phi-4-4bit (*)19.65

(*) an average of three runs.

Ollama ones are not MLX format. The MLX ones by the mlx-community are faster.

Environment #

  • MacBook Pro M1 Pro.
    • CPU 8 cores (6 performance and 2 efficiency).
    • GPU 14 cores.
    • 16 GB memory.
    • macOS Sequoia 15.3.2
  • Ollama version 0.6.3
  • LM Studio v0.3.14.
    • Metal llama.cpp v1.23.1
    • LM Studio MLX v0.11.1

Official Benchmark Results #

BenchmarkMetricDeepSeek R1 Distill Qwen 14BDeepSeek R1 Distill Llama 8BPhi-4 (14B)Gemma 3 PT 12BGemma 3 PT 4B
AAII4934403424
AGIEval3-5-shot57.442.1
AI2D75.263.2
AIME 2024cons@6480
AIME 2024pass@169.7
AlignBench v1.1
ARC-c25-shot68.956.2
ARC-e0-shot88.382.4
Arena-Hard
BIG-Bench Hardfew-shot72.650.9
BLINK35.938
BoolQ0-shot78.872.3
ChartQA74.763.6
COCOcap111102
CodeForcesrating1481
CountBenchQA17.826.1
DocVQA (val)82.372.8
DROP1-shot75.572.260.1
ECLeKTic17.211
FloRes4639.2
Global-MMLU-Lite69.457
GPQA
GPQA5-shot25.415
GPQADiamond pass@159.156.1
GSM8K
GSM8K8-shot7138.4
HellaSwag10-shot84.277.2
HumanEval0-shot45.736
HumanEval82.6
IFEvalstrict-prompt
IndicGenBench61.757.2
InfoVQA (val)54.844.1
LiveBench 0831
LiveCodeBenchpass@153.1
LiveCodeBench 2305-2409
MATH4-shot43.324.2
MATH80.4
MATH-500pass@193.9
MBPP3-shot60.446
MGSM80.664.334.7
MMLU-redux
MMLU84.874.559.6
MMLU (Pro COT)5-shot45.329.2
MMMU (pt)50.339.2
MultiPL-E
MT-bench
Natural Questions5-shot31.420
OKVQA58.751
PIQA0-shot81.879.6
RealWorldQA52.245.5
ReMI38.527.3
SimpleQA3
SocialIQA0-shot53.451.9
SpatialSense VQA6050.9
TallyQA51.842.5
TextVQA (val)66.558.9
TriviaQA5-shot78.265.8
VQAv271.263.9
WinoGrande5-shot74.364.7
WMT24++ (ChrF)53.948.4
XQuAD (all)74.568

Sources: