Skip to main content

Benchmark LLM on Apple Silicon

We benchmark the token generation speed or inference performance of large language models (LLMs) on Apple Silicon. I am trying to answer this question: what is the best LLM that runs on an Apple MacBook Pro with a 16 GB M1 Pro chip.

Prompt: write a long story about little red riding hood

Software Model Tokens / second
Ollama gemma3:4b (a2af6cc3eb7f) 34.64
Ollama gemma3:12b (f4031aab637d) 13.92
LM Studio mlx-community/gemma-3-12b-it-4bit (*) 17.02
Ollama deepseek-r1:14b (ea35dfe18182) 12.87
Ollama phi4-mini:3.8b-q4_K_M (78fad5d182a7) 35.09
Ollama phi4:14b-q4_K_M (ac896e5b8b34) 12.75
LM Studio mlx-community/phi-4-4bit (*) 19.65

(*) an average of three runs.

Ollama ones are not MLX format. The MLX ones by the mlx-community are faster.

Environment #

  • MacBook Pro M1 Pro.
    • CPU 8 cores (6 performance and 2 efficiency).
    • GPU 14 cores.
    • 16 GB memory.
    • macOS Sequoia 15.3.2
  • Ollama version 0.6.3
  • LM Studio v0.3.14.
    • Metal llama.cpp v1.23.1
    • LM Studio MLX v0.11.1

Official Benchmark Results #

Benchmark Metric DeepSeek R1 Distill Qwen 14B DeepSeek R1 Distill Llama 8B Phi-4 (14B) Gemma 3 PT 12B Gemma 3 PT 4B
AAII 49 34 40 34 24
AGIEval 3-5-shot 57.4 42.1
AI2D 75.2 63.2
AIME 2024 cons@64 80
AIME 2024 pass@1 69.7
AlignBench v1.1
ARC-c 25-shot 68.9 56.2
ARC-e 0-shot 88.3 82.4
Arena-Hard
BIG-Bench Hard few-shot 72.6 50.9
BLINK 35.9 38
BoolQ 0-shot 78.8 72.3
ChartQA 74.7 63.6
COCOcap 111 102
CodeForces rating 1481
CountBenchQA 17.8 26.1
DocVQA (val) 82.3 72.8
DROP 1-shot 75.5 72.2 60.1
ECLeKTic 17.2 11
FloRes 46 39.2
Global-MMLU-Lite 69.4 57
GPQA
GPQA 5-shot 25.4 15
GPQA Diamond pass@1 59.1 56.1
GSM8K
GSM8K 8-shot 71 38.4
HellaSwag 10-shot 84.2 77.2
HumanEval 0-shot 45.7 36
HumanEval 82.6
IFEval strict-prompt
IndicGenBench 61.7 57.2
InfoVQA (val) 54.8 44.1
LiveBench 0831
LiveCodeBench pass@1 53.1
LiveCodeBench 2305-2409
MATH 4-shot 43.3 24.2
MATH 80.4
MATH-500 pass@1 93.9
MBPP 3-shot 60.4 46
MGSM 80.6 64.3 34.7
MMLU-redux
MMLU 84.8 74.5 59.6
MMLU (Pro COT) 5-shot 45.3 29.2
MMMU (pt) 50.3 39.2
MultiPL-E
MT-bench
Natural Questions 5-shot 31.4 20
OKVQA 58.7 51
PIQA 0-shot 81.8 79.6
RealWorldQA 52.2 45.5
ReMI 38.5 27.3
SimpleQA 3
SocialIQA 0-shot 53.4 51.9
SpatialSense VQA 60 50.9
TallyQA 51.8 42.5
TextVQA (val) 66.5 58.9
TriviaQA 5-shot 78.2 65.8
VQAv2 71.2 63.9
WinoGrande 5-shot 74.3 64.7
WMT24++ (ChrF) 53.9 48.4
XQuAD (all) 74.5 68

Sources: