Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

Kog AI has launched a tech preview of its Kog Inference Engine (KIE), achieving real-time large language model (LLM) inference speeds of 3,000 output tokens per second per request on eight AMD MI300X GPUs, and 2,100 tokens per second on eight NVIDIA H200 GPUs, according to blog.kog.ai. This preview currently runs a 2-billion parameter model, with plans to support larger third-party mixture-of-experts (MoE) models at similar speeds.

The KIE achieves these speeds by optimizing the entire software stack through architecture, engine, and kernel co-design, focusing on maximizing memory bandwidth. This approach allows Kog AI to reach performance levels comparable to dedicated inference hardware cards. The company has made a live coding playground available for users to test the inference speed themselves, highlighting the importance of single-request LLM decoding speed for AI agents.

This development is significant as it demonstrates that standard datacenter GPUs can deliver inference speeds previously thought achievable only with specialized hardware. This could lower barriers for deploying high-speed AI applications, enabling faster and more efficient AI agent responses. The ability to run large third-party MoE models at these speeds also suggests broader applicability across AI workloads.

Kog AI plans to expand support to larger models in upcoming releases and continue refining the KIE’s performance. Observers should watch for updates on the integration of third-party MoE models and further benchmarks that may influence AI infrastructure choices across industries, as detailed on blog.kog.ai.

Editorial standards. Reported and edited at Startupniti's news desk from the sources listed in the right rail. Every fact traces to a citation. If something looks wrong, write to corrections.