I'm by no means an expert but my understanding is that a lot of the performance from libraries like openBLAS comes from targeting specific architectures (e.g. particular instruction sets on a series of processors). You can probably milk some more performance by targeting the web assembly architecture specifically (assuming openBLAS hasn't started doing similar themselves).