Benchmark LLM on Apple Silicon
We benchmark the token generation speed or inference performance of large language models (LLMs) on Apple Silicon. I am trying to answer this question: what is the best LLM that runs on an Apple MacBook Pro with a 16 GB M1 Pro chip.
Prompt: write a long story about little red riding hood
Software | Model | Tokens / second |
---|---|---|
Ollama | gemma3:4b (a2af6cc3eb7f) | 34.64 |
Ollama | gemma3:12b (f4031aab637d) | 13.92 |
LM Studio | mlx-community/gemma-3-12b-it-4bit (*) | 17.02 |
Ollama | deepseek-r1:14b (ea35dfe18182) | 12.87 |
Ollama | phi4-mini:3.8b-q4_K_M (78fad5d182a7) | 35.09 |
Ollama | phi4:14b-q4_K_M (ac896e5b8b34) | 12.75 |
LM Studio | mlx-community/phi-4-4bit (*) | 19.65 |
(*) an average of three runs.
Ollama ones are not MLX format. The MLX ones by the mlx-community are faster.
Environment #
- MacBook Pro M1 Pro.
- CPU 8 cores (6 performance and 2 efficiency).
- GPU 14 cores.
- 16 GB memory.
- macOS Sequoia 15.3.2
- Ollama version 0.6.3
- LM Studio v0.3.14.
- Metal llama.cpp v1.23.1
- LM Studio MLX v0.11.1
Official Benchmark Results #
Benchmark | Metric | DeepSeek R1 Distill Qwen 14B | DeepSeek R1 Distill Llama 8B | Phi-4 (14B) | Gemma 3 PT 12B | Gemma 3 PT 4B |
---|---|---|---|---|---|---|
AAII | 49 | 34 | 40 | 34 | 24 | |
AGIEval | 3-5-shot | 57.4 | 42.1 | |||
AI2D | 75.2 | 63.2 | ||||
AIME 2024 | cons@64 | 80 | ||||
AIME 2024 | pass@1 | 69.7 | ||||
AlignBench v1.1 | ||||||
ARC-c | 25-shot | 68.9 | 56.2 | |||
ARC-e | 0-shot | 88.3 | 82.4 | |||
Arena-Hard | ||||||
BIG-Bench Hard | few-shot | 72.6 | 50.9 | |||
BLINK | 35.9 | 38 | ||||
BoolQ | 0-shot | 78.8 | 72.3 | |||
ChartQA | 74.7 | 63.6 | ||||
COCOcap | 111 | 102 | ||||
CodeForces | rating | 1481 | ||||
CountBenchQA | 17.8 | 26.1 | ||||
DocVQA (val) | 82.3 | 72.8 | ||||
DROP | 1-shot | 75.5 | 72.2 | 60.1 | ||
ECLeKTic | 17.2 | 11 | ||||
FloRes | 46 | 39.2 | ||||
Global-MMLU-Lite | 69.4 | 57 | ||||
GPQA | ||||||
GPQA | 5-shot | 25.4 | 15 | |||
GPQA | Diamond pass@1 | 59.1 | 56.1 | |||
GSM8K | ||||||
GSM8K | 8-shot | 71 | 38.4 | |||
HellaSwag | 10-shot | 84.2 | 77.2 | |||
HumanEval | 0-shot | 45.7 | 36 | |||
HumanEval | 82.6 | |||||
IFEval | strict-prompt | |||||
IndicGenBench | 61.7 | 57.2 | ||||
InfoVQA (val) | 54.8 | 44.1 | ||||
LiveBench 0831 | ||||||
LiveCodeBench | pass@1 | 53.1 | ||||
LiveCodeBench 2305-2409 | ||||||
MATH | 4-shot | 43.3 | 24.2 | |||
MATH | 80.4 | |||||
MATH-500 | pass@1 | 93.9 | ||||
MBPP | 3-shot | 60.4 | 46 | |||
MGSM | 80.6 | 64.3 | 34.7 | |||
MMLU-redux | ||||||
MMLU | 84.8 | 74.5 | 59.6 | |||
MMLU (Pro COT) | 5-shot | 45.3 | 29.2 | |||
MMMU (pt) | 50.3 | 39.2 | ||||
MultiPL-E | ||||||
MT-bench | ||||||
Natural Questions | 5-shot | 31.4 | 20 | |||
OKVQA | 58.7 | 51 | ||||
PIQA | 0-shot | 81.8 | 79.6 | |||
RealWorldQA | 52.2 | 45.5 | ||||
ReMI | 38.5 | 27.3 | ||||
SimpleQA | 3 | |||||
SocialIQA | 0-shot | 53.4 | 51.9 | |||
SpatialSense VQA | 60 | 50.9 | ||||
TallyQA | 51.8 | 42.5 | ||||
TextVQA (val) | 66.5 | 58.9 | ||||
TriviaQA | 5-shot | 78.2 | 65.8 | |||
VQAv2 | 71.2 | 63.9 | ||||
WinoGrande | 5-shot | 74.3 | 64.7 | |||
WMT24++ (ChrF) | 53.9 | 48.4 | ||||
XQuAD (all) | 74.5 | 68 |
Sources:
- Artificial Analysis Intelligence Index (AAII): https://artificialanalysis.ai/
- DeepSeek R1 Distill Qwen 14B: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-14B
- DeepSeek R1 Distill Llama 8B: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- Gemma 3 PT 12B: https://huggingface.co/google/gemma-3-12b-pt
- Gemma 3 PT 4B: https://huggingface.co/google/gemma-3-4b-pt
- Phi-4 (14B): https://huggingface.co/microsoft/phi-4