iPhone 17 Pro successfully runs 400B parameter LLM on-device
Why it matters: A new Flash-MoE inference engine enables massive models (that typically need 200GB RAM) to run on a phone with just 12GB RAM by streaming weights from SSD. This fundamentally changes what's possible for edge AI and could democratize powerful local inference.
Read more