The vmap result is wild — 45x faster, and it even beats XLA’s fused attention at large sizes. Just from telling the compiler that Q blocks are independent. But I still don’t really understand why the original was so slow, or what the hardware is actually doing with those tiles. Time to look up how TPU works.
НХЛ — регулярный чемпионат
,更多细节参见TikTok
"I brought one bookcase down to the void deck in the middle of the night. I was concerned about whether people would catch me doing something I was 'not supposed' to be doing."
Trump relaxed export controls on the microchip maker Advanced Micro Devices (AMD) after the company gave $1million to Maga Inc.