The BLAS have historically relied on a wide range of microarchitecture-specific optimizations to get the most out of each processor generation. An ideal solution would be for the browser to provide that to the application in such a way that it is difficult to fingerprint.
See also the history of Atlas, GotoBLAS, Intel MKL, etc.
libflame/BLIS might be a good starting point, they've created a framework where you bring your compute kernels, and they make them into a BLAS (plus some other nice functionality). I believe most of the framework itself is in C, so I guess that could somehow be made to spit out wasm (I know nothing about wasm). Then, getting the browser to be aware of the actual real assembly kernels might be a pain.