Audio Processing Lands in llama-server with Gemma 4 Support | Tool Update
llama.cpp's llama server now supports audio input processing via the new multimodal audio conformer encoder, initially targeting Gemma 4's audio capabilities.
Published on MyPrivateClaw
Apr 13, 2026, 8:37 AM UTC
Coverage date
Apr 13, 2026
Last updated
Apr 13, 2026, 8:37 AM UTC
News summary
The llama.cpp project has merged audio processing support into llama server, enabling multimodal audio input for locally served models. The initial implementation targets Gemma 4 's audio conformer encoder, with Qwen3 audio support (Qwen3 Omni and Qwen3 ASR) following shortly after. What Happened Two pull requests merged into the main llama.cpp branch within 48 hours: mtmd: add Gemma 4 audio conformer encoder support — adds the audio encoder architecture required to process Gemma 4's audio tokens through llama server mtmd: qwen3 audio support (qwen3 omni and qwen3 asr) — extends the multimodal audio pipeline to Qwen3's audio models Both PRs are part of the mtmd (multimodal) subsystem in llama.cpp. The changes allow llama server to accept audio files as input alongside text, process them through the model's audio encoder, and return text responses — enabling local speech to text and audi…