Fresh approaches to AI hardware are emerging, like the Groq Chip which utilizes software-defined memory and networking without caches. To simplify reasoning about the chip, Groq makes it synchronous so the compiler can orchestrate data flows between memory and compute and design network flows between chips. Every run becomes deterministic, removing the need for benchmarking models since execution time can be precisely calculated during compilation. With these innovations, Groq achieved state-of-the-art speed of 240 tokens/s on 70B LLaMA.
Fascinating stuff - a synchronous distributed system allows treating 1000 chips as one, knowing exactly when data will arrive cycle-for-cycle and which network paths are open. The compiler can balance loads. No more indeterminism or complexity in optimizing performance (high compute utilization). A few basic operations suffice, with the compiler handling optimization, instead of 100 kernel variants of CONV for all shapes. Of course, it integrates with Pytorch and other frameworks.
Fascinating stuff - a synchronous distributed system allows treating 1000 chips as one, knowing exactly when data will arrive cycle-for-cycle and which network paths are open. The compiler can balance loads. No more indeterminism or complexity in optimizing performance (high compute utilization). A few basic operations suffice, with the compiler handling optimization, instead of 100 kernel variants of CONV for all shapes. Of course, it integrates with Pytorch and other frameworks.
https://youtu.be/hk4QGpQAvSY?t=57