MLX on Apple Silicon: Run Gemma 4 Locally at Full Speed

MLX is Apple's open-source machine learning framework, purpose-built for the unified memory architecture of M-series chips. For Apple Silicon users, it delivers 3–4x faster inference than Ollama or llama.cpp on the same hardware — because it uses the Neural Engine and the full unified memory bandwidth that Ollama's llama.cpp backend cannot fully exploit. If you have an M1, M2, M3, M4, or M5 Mac and you're running local LLMs through Ollama, you're leaving significant performance on the table.

THE EDGE — WEEKLY DIGEST

Get more guides like this in your inbox

No spam. Unsubscribe anytime.

Why MLX Is Faster on Apple Silicon

Ollama uses llama.cpp as its inference backend on Apple Silicon. llama.cpp is excellent cross-platform software but was not designed specifically for Apple's hardware. MLX, by contrast, was built from the ground up for the M-series architecture. It uses the Apple Neural Engine (ANE) for matrix operations, exploits the full unified memory bandwidth (up to 546 GB/s on M4 Max), and uses Metal for GPU compute. The result is 3–4x faster prefill and 1.5–2x faster decode compared to llama.cpp on the same Mac.

NOTEThe M5 Max's Neural Accelerators provide an additional 3–4x prefill speedup on top of the base MLX advantage. On an M5 Max Mac Studio, MLX inference is expected to be 10–15x faster than llama.cpp for long-context workloads.

Install mlx-vlm

Gemma 4 is a multimodal model, so it requires mlx-vlm — not mlx-lm. The mlx-vlm package handles model downloading from Hugging Face, quantization, and serving for vision-language models. Make sure to install the latest version with -U to get Gemma 4 support (added in mlx-vlm 0.4.3).

bash

pip install -U mlx-vlm

# Run Gemma 4 26B MoE (recommended for 16 GB+ RAM)
# This is a Mixture of Experts model — only 4B params active per token, very fast
mlx_vlm.generate \
  --model mlx-community/gemma-4-26b-a4b-it-4bit \
  --prompt "Explain the attention mechanism"

# For 8–12 GB RAM — Gemma 4 E4B (edge model)
mlx_vlm.generate \
  --model mlx-community/gemma-4-e4b-it-4bit \
  --prompt "Write a Python function to parse JSON safely"

NOTEGemma 4 was released April 2, 2026 under Apache 2.0. The 26B MoE model (mlx-community/gemma-4-26b-a4b-it-4bit) is the most popular MLX variant with 67,000+ downloads — it uses sparse expert routing so only 4B parameters are active per token, making it fast and memory-efficient. The E4B edge model is the best option for 8 GB Macs.

Serve as an OpenAI-Compatible API

mlx-vlm includes a built-in server that exposes an OpenAI-compatible API. This means Open WebUI, Continue.dev, LangChain, and any other tool that supports a custom base URL works with MLX out of the box:

bash

# Start the MLX server on port 8080
mlx_vlm.server \
  --model mlx-community/gemma-4-26b-a4b-it-4bit \
  --port 8080

# Test it
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/gemma-4-26b-a4b-it-4bit", "messages": [{"role": "user", "content": "Hello"}]}'

TIPTo use MLX with Open WebUI, set the Ollama base URL to http://localhost:8080 in Open WebUI settings. The API format is identical — Open WebUI cannot tell the difference between an Ollama server and an MLX server.

Performance by Mac Model

These are approximate real-world decode speeds for Gemma 4 26B MoE at 4-bit quantization using mlx-vlm. Because it is a Mixture of Experts model with only 4B active parameters per token, it is significantly faster than a dense 26B model. Prefill speeds are 3–10x faster than these figures depending on context length.

NOTEApproximate decode speeds (Gemma 4 26B MoE, 4-bit): M2 Pro 16GB → ~42 tok/s | M3 Pro 18GB → ~55 tok/s | M4 Pro 24GB → ~72 tok/s | M4 Max 48GB → ~95 tok/s | M4 Max 128GB → ~130 tok/s. For 8 GB Macs, use Gemma 4 E4B which runs at ~85 tok/s on M3. All figures via mlx-vlm community benchmarks.

Fine-Tuning with MLX

MLX also supports LoRA fine-tuning directly on your Mac — no cloud GPU required. Fine-tuning on a custom dataset of a few thousand examples takes 1–2 hours on an M3 Pro. This is one of the most compelling use cases for Apple Silicon: the ability to fine-tune a model on sensitive data that never leaves your machine. Note: LoRA fine-tuning for Gemma 4 uses mlx-lm on the E4B edge model — mlx-vlm fine-tuning support for the larger MoE models is still experimental as of April 2026.

bash

# Fine-tuning uses mlx-lm (text-only) on the E4B edge model
pip install -U mlx-lm

mlx_lm.lora \
  --model mlx-community/gemma-4-e4b-it-4bit \
  --train \
  --data path/to/your/dataset.jsonl \
  --iters 1000 \
  --batch-size 4

# Merge the adapter into the base model
mlx_lm.fuse \
  --model mlx-community/gemma-4-e4b-it-4bit \
  --adapter-path adapters/

TIPLoRA fine-tuning on sensitive data (medical records, legal documents, proprietary code) is one of the strongest privacy use cases for Apple Silicon. The data never leaves your machine, and the resulting model is entirely yours.

MLX on Apple Silicon: Run Gemma 4 Locally at Full Speed

Why MLX Is Faster on Apple Silicon

Install mlx-vlm

Serve as an OpenAI-Compatible API

Performance by Mac Model

Fine-Tuning with MLX

Mac mini M4 Pro for Local AI: Complete Setup Guide (2026)

Best GPU for Local LLM in 2026: The Honest Hardware Guide

Mac vs PC Build for Local AI in 2026: The Honest Comparison