Benchmark LLM on Apple Silicon

Table of Contents

We benchmark the token generation speed or inference performance of large language models (LLMs) on Apple Silicon. I am trying to answer this question: what is the best LLM that runs on an Apple MacBook Pro with a 16 GB M1 Pro chip.

Prompt: write a long story about little red riding hood

Software	Model	Tokens / second
Ollama	gemma3:4b (a2af6cc3eb7f)	34.64
Ollama	gemma3:12b (f4031aab637d)	13.92
LM Studio	mlx-community/gemma-3-12b-it-4bit (*)	17.02
Ollama	deepseek-r1:14b (ea35dfe18182)	12.87
Ollama	phi4-mini:3.8b-q4_K_M (78fad5d182a7)	35.09
Ollama	phi4:14b-q4_K_M (ac896e5b8b34)	12.75
LM Studio	mlx-community/phi-4-4bit (*)	19.65

(*) an average of three runs.

Ollama ones are not MLX format. The MLX ones by the mlx-community are faster.

Environment #

MacBook Pro M1 Pro.
- CPU 8 cores (6 performance and 2 efficiency).
- GPU 14 cores.
- 16 GB memory.
- macOS Sequoia 15.3.2
Ollama version 0.6.3
LM Studio v0.3.14.
- Metal llama.cpp v1.23.1
- LM Studio MLX v0.11.1

Official Benchmark Results #

Benchmark	Metric	DeepSeek R1 Distill Qwen 14B	DeepSeek R1 Distill Llama 8B	Phi-4 (14B)	Gemma 3 PT 12B	Gemma 3 PT 4B
AAII		49	34	40	34	24
AGIEval	3-5-shot				57.4	42.1
AI2D					75.2	63.2
AIME 2024	cons@64	80
AIME 2024	pass@1	69.7
AlignBench v1.1
ARC-c	25-shot				68.9	56.2
ARC-e	0-shot				88.3	82.4
Arena-Hard
BIG-Bench Hard	few-shot				72.6	50.9
BLINK					35.9	38
BoolQ	0-shot				78.8	72.3
ChartQA					74.7	63.6
COCOcap					111	102
CodeForces	rating	1481
CountBenchQA					17.8	26.1
DocVQA (val)					82.3	72.8
DROP	1-shot			75.5	72.2	60.1
ECLeKTic					17.2	11
FloRes					46	39.2
Global-MMLU-Lite					69.4	57
GPQA
GPQA	5-shot				25.4	15
GPQA	Diamond pass@1	59.1		56.1
GSM8K
GSM8K	8-shot				71	38.4
HellaSwag	10-shot				84.2	77.2
HumanEval	0-shot				45.7	36
HumanEval				82.6
IFEval	strict-prompt
IndicGenBench					61.7	57.2
InfoVQA (val)					54.8	44.1
LiveBench 0831
LiveCodeBench	pass@1	53.1
LiveCodeBench 2305-2409
MATH	4-shot				43.3	24.2
MATH				80.4
MATH-500	pass@1	93.9
MBPP	3-shot				60.4	46
MGSM				80.6	64.3	34.7
MMLU-redux
MMLU				84.8	74.5	59.6
MMLU (Pro COT)	5-shot				45.3	29.2
MMMU (pt)					50.3	39.2
MultiPL-E
MT-bench
Natural Questions	5-shot				31.4	20
OKVQA					58.7	51
PIQA	0-shot				81.8	79.6
RealWorldQA					52.2	45.5
ReMI					38.5	27.3
SimpleQA				3
SocialIQA	0-shot				53.4	51.9
SpatialSense VQA					60	50.9
TallyQA					51.8	42.5
TextVQA (val)					66.5	58.9
TriviaQA	5-shot				78.2	65.8
VQAv2					71.2	63.9
WinoGrande	5-shot				74.3	64.7
WMT24++ (ChrF)					53.9	48.4
XQuAD (all)					74.5	68

Sources:

Artificial Analysis Intelligence Index (AAII): https://artificialanalysis.ai/
DeepSeek R1 Distill Qwen 14B: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-14B
DeepSeek R1 Distill Llama 8B: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B
Gemma 3 PT 12B: https://huggingface.co/google/gemma-3-12b-pt
Gemma 3 PT 4B: https://huggingface.co/google/gemma-3-4b-pt
Phi-4 (14B): https://huggingface.co/microsoft/phi-4