Developer runs Gemma 4 and Qwen 3.6 locally on macOS with llama.cpp

Kyle Howells successfully set up a local coding agent on macOS running Gemma 4 26B-A4B and Qwen 3.6 35B-A3B models using llama.cpp with Metal acceleration. The setup supports OpenAI-compatible API calls and multimodal inputs, including screenshots, enabling a responsive coding assistant without internet connectivity, according to ikyle.me.

After experiencing internet outages that disrupted access to cloud-based coding agents, Howells aimed to create a fast, local solution on his Mac. He built llama.cpp with Metal support for macOS and deployed Gemma 4 in GGUF format with Multi-Token Prediction (MTP) decoding, which doubled the model's speed. The final configuration also integrated Qwen 3.6, allowing the agent to handle images and screenshots in real time.

This local deployment addresses common developer challenges of latency and privacy by eliminating reliance on external servers. The use of llama.cpp with Metal acceleration leverages macOS hardware efficiently, making large language models like Gemma 4 and Qwen 3.6 practical for on-device coding assistance. This approach contrasts with typical cloud-based AI services, offering greater control and offline functionality.

Howells shared a demonstration video showcasing the agent's real-time responsiveness on macOS, highlighting the practical usability of the setup. The blog post detailing the process was published 10 hours ago on ikyle.me, providing step-by-step guidance for replicating the environment.

Editorial standards. Reported and edited at Startupniti's news desk from the source listed in the right rail. Every fact traces to a citation. If something looks wrong, write to corrections.