Another technique is to transpose the left matrix so each dot product is scanned in row-order and hence more cache friendly.
Another one I tried ages ago is to use a single loop counter and "de-interleave" the bits to get what would normally be 3 distinct loop variables. For this you need to modify the entry in the result matrix rather than having it write-only. It has the effect of accessing like a z-order curve but in 3 dimensions. It's a bit of overhead, but you can also unroll say 8 iterations (2x2x2) which helps make up for it. This ends up making good use of both caches and even virtual memory if things don't fit in RAM. OTOH it tends to prefer sizes that are a power of 2.
Loop order and in-memory striding choice in general is a pretty interesting. I worked on an old voxel-based game engine that used a very complicated on-disk storage format. It seemed.. unnecessary, so I experimented with very simplified strategy of literally writing out voxel color bytes to a stream and gzipping it, which turned out to be both faster and smaller in my application. iirc it was like 10x smaller but only if I used the right striding order.
I knew that the striding order benefits would only work on voxel data similar to the data I was testing on, and I wanted to generalize it so that it could be configured dynamically so each dataset could use the best striding order for that particular data. I tried setting up numpy ndarray so that you could configure the memory striding order independently of the array indices so that my code wouldn't have to change if the striding order changed, but I could never get it to work and unfortunately I wasn't able to convince numpy folks I interacted with that this feature would be helpful.
I find it surprising that, even after using all those tricks, they are still only to achieve around 50% of the theoretical peak performance of the chip in terms of GFLOPS. And that's for matrix multiplication, which is a nearly ideal case for these techniques.
a polyhedral compiler wouldn't find this either - polyhedral compilation is for finding optimal schedules for loop nests i.e., the order in which independent (wrt dataflow) iterations run. as far as i know you, a transpose can't be expressed in the polyhedral model.
Hmm I thought GCC's polyhedral optimizations had a loop transposition, but it turned out I was remembering an old "-floop-transpose" flag that seems to be only in old Apple GCC to get a SPEC win…
The transpose is irrelevant after moving on to z-order access patterns and the 3d equivalent. Not quite as possible to reach peak sustained performance but well worth the trade in simplicity IMO.
Matrix multiplication is O(n^3), while transposing is O(n^2). So for a large matrix, transposing will only take a tiny fraction of the time spent in multiplication. (But the loop reordering trick mentioned in @mynameismon's link is a very nice alternative.)
you don't transpose it before the matmul, you always have it transposed (i.e., when you print the weights of a linear layer in pytorch, you're actually seeing (A^t)^t and what's stored is A^t.
My question: will this kind of thing become more mainstream? I've seen the web emerge, go from static pages to entire apps being delivered and executed in the browser. The last bastion of native apps and libraries seems to be highly optimized algorithms but maybe those will also migrate to a deliver-from-the-web and execute in some kind of browser sandbox.
Java promised to deliver some version of native code execution but the Java app/applet idea never seemed to take off. In some ways it seems superior to what we have now but maybe the security concerns we had during that era held Java back too much. Or am I misunderstanding what WebAssembly can bring to the game?
The web always feels terrible compared to a native app, especially on a phone. Some of the difference is due to Safari being a bad browser (probably intentional), but a bigger part is that the threading model makes it really difficult to have a responsive UI. Not to mention the browser’s gestures often clash with the application’s gestures.
> The web always feels terrible compared to a native app
'always' is certainly not true. Yes, with modern frameworks it is very easy to build websites which are slow. But it is also possible to build websites with butter smooth animations and instant responses.
I hope that in the future we will get frameworks that make it easier to create lightweight web apps, so that we will see more high performance apps.
I made a another comment further down, but basically web apps can’t run as fast as native apps for a variety of reasons.
One reason is the thread model. There are two main threads that need to be synchronized (browser main thread and JS main thread) which will always be slower than a single main thread.
Another reason is that layout and measurement of elements in HTML is really complicated. Native apps heavily encourage deferred measurement which lets the app measure and lay itself out once per render pass. In JavaScript, layouts may need to happen immediately based on what properties of the dom you’re reading and setting.
I think nobody will argue against, that the majority of native apps are faster than web apps. But the key point isn't faster, but how much is fast enough.
In general, 60fps is considered sufficient for smooth rendering and even 5 years ago, mobile hardware was fast enough for 60 fps web page rendering. However, many web pages a built in ways, that the browsers can't achieve that goal.
So yes, it is harder for developers to create a pleasant experience and as a result there are more bad apples in the web app basket.
I disagree. Yes, if you have a static webpage and all you need to do is scroll, then you can easily get 60 fps, notably because the scrolling is handled natively by the browser and basically is just a translation on the GPU. If the web app accepts user input, especially touch with dragging, then the page will not feel native with the current batch of browsers for the reasons I mentioned above.
You don’t have to always use the DOM. You can render in another thread, or even run compute in another thread and use the animation frame system to handle updates.
Having said that, maybe a little less than 10 years ago, we achieved the desired performance with touch-screen dragging of DOM elements. I don’t remember specifics, but we didn’t use any frameworks.
I would want to see an iOS app that renders 1000 items in a list, with each item pushing to the navigation stack when clicked. And then I would want to see the same thing implemented in the browser with similar responsiveness on iOS Safari.
Writing this out made me realize another limitation of the browser: the page unloads when navigating, so going “back” can be a pain / impossible depending on the circumstances.
Actually, the biggest issue here is writing a native iOS app, due to the limitations enforced by Apple. Browsers on iOS are also a big pain, because all iOS browsers use the Safari rendering engine and the users don't even have the choice to use anything else.
However, you wrote that 'the page unloads when navigating' which made me wonder, how much you know about frontend development, because the pattern to prevent browsers from navigating is such a common topic and there are multiple solution patterns, all well understood by professional frontend developers:
- preventDefault
- return false
- use anchors
The 1000 item list is indeed a problem frontend developers have to be aware of. Modern flagship smartphones may have the capability to render such lists, but less powerful devices can't handle the load in a 60fps fashion. So you have to build list views that render only the visible part of the list. In the end, you have to do that anyway, because even for high-end devices there are limits to what they can show, but many web-components are not build with such scenarios in mind. And that is certainly one of the reasons why there are so many web apps with slow performance rendering.
I generally find Safari’s performance better than other browsers. Also, every mainstream JS runtime is multithreaded. There are limitations on what can be shared between threads, but you can optimize a lot despite those limitations (including using WASM on the web, and native extensions/FFI on Node/Deno).
On iOS, the perf of Safari is definitely better than every other browser, because every other browser is mandated to use WkWebView by Apple. They aren’t allowed to implement their own engine. Of course Apple isn’t subject to the same restriction.
On a Mac, Safari Javascript generally outperforms Chrome and Firefox (other browsing tasks are also generally better performing), but there are some workloads where Safari turns out slower, especially when the developer has put a lot of work into Chrome-specific optimization.
Safari also generally uses a lot less memory and CPU for the same websites. Chrome in particular burns through my battery very quickly, and is basically completely incapable of keeping up with my browser use style (it just crashes when I try to open a few hundred tabs). Presumably nobody with authority at Google is a heavy laptop web-user or prioritizes client-side resource use: Google’s websites are also among the biggest browser resource hogs, even when sitting idle in a background tab.
Safari often takes a couple years longer than other browsers to implement cutting-edge features. This seems to me like a perfectly reasonable design decision; some web developers love complaining about it though, and some sites that were only developed against the most recent versions of Chrome don’t work correctly in Safari.
In my experience, if the thing I’m working on runs in Safari, it’s buttery smooth. And if it doesn’t run, it completely shits the bed. Stuff like “we don’t support that way of doing shadows in SVG, so rather than simply not implement that property, we’ve turned your entire SVG element black”.
Every mainstream JS runtime has some sort of background garbage collection and parsing/compilation, but for any execution that has to interact with JavaScript heap or stack state you’re still in single threaded territory. SharedArrayBuffer can help if you’re willing to give up on JavaScript objects entirely, and for WebAssembly this is less of a burden, but that’s not going to help you perform rendering on most web apps concurrently. JSCore goes a little bit further and can run javascript in the same runtime instance on multiple threads with some significant limitations, but this isn’t exposed to developers on Safari.
It really depends on your use case and on how you design for threads. Quite a lot of data types can be sent by postMessage, and a decent amount of them are transferable. If your use case doesn’t require passing around functions (and if you can’t find a way to work around that), many workloads which would benefit from multithreading in another language will likely similarly benefit from JS threads (Web Workers, Node’s Worker Threads, etc). The biggest determining factor at that point is whether the overhead of instantiating them, and of message passing, is worth it. (And the latter is often a lot less than people realize; in my experience it’s almost universally faster than crossing a WASM/FFI-type of boundary.)
On what engine have you seen postmessage be faster than a WASM/JS boundary? It’s at least an order of magnitude slower on Chrome/V8, even when a call needs to pass significant strings into WASM (microseconds vs hundreds of ns). It’s also worth noting that postmessage is itself an FFI boundary that requires a JS to C++ transition on every engine I’m aware of.
I’ve seen in on Node 16 (also V8) trying both purpose-built and custom bindings to libraries built in Rust, C, C++ and Go. All of the purpose built libraries acknowledge there’s an overhead to cross the native/WASM boundary. In my experience, measuring extensively, postMessage with native runtime objects outperforms any other ser/de solution that works with JS values.
The issue isn’t so much having additional threads, it’s needing two main threads. The browser has one main thread for accepting user input and then dispatches the relevant user events to the JS main thread.
The browser can either dispatch the events asynchronously, leading to the events being handled in a noticeably delayed way, or the browser can block its main thread until the JS dispatch finishes, leading to fewer UI events being handled. Either way is an inferior experience.
Technically yeah, but the threads could only communicate through message passing which isn't ideal for performance. The more recent major improvement is for workers to be able to share a single memory space similar to how threading works in native applications.
> Some of the difference is due to Safari being a bad browser (probably intentional), but a bigger part is that only having one thread makes it really difficult to have a responsive UI
Interestingly Safari is actually generally better than other browsers for running a responsive smooth UI. Not sure how much of that is the Safari engine and how much of it is better CPU on iPhones. But even on a first generation iPad or an iPhone 4 it was possible to get 60fps rendering fairly easily. The same could not be said for even higher end android phones of the time.
The web solved software distribution. Full stop. There only remain the edge cases, and that's where webasm wants to help.
Sun/Java wanted badly to solve this problem, but tried to do too much too soon. Java gave devs in 1999 cutting edge OOP tools for doing GUIs (e.g. Swing) but distribution was always the problem. Installing and running browser plugins was always error prone, and it turned out the browser was itself just good enough to deliver value, so it won. (With the happy side-effect of giving the world-wide community of devs one of the gooeyist languages ever to express their fever dreams of what application code should be).
The question in my mind is whether there is enough demand for the kinds of software webasm enables, especially given that other routes of distribution (app stores) have filled in the gaps of what the web delivers, and are generally a lot more pleasant and predictable to native devs.
I'm not really equipped to predict anything, but I think recent surge in popularity of simple RISC-y[1] architectures like ARM will allow the WebAssembly standard to stay small yet efficient. I'm hopeful, but standards often have a way of not keeping up with the newest technology so we'll see.
I don't expect ARM to have much effect here. The wasm instruction set is basically a low-level compiler intermediate representation, and to go from that to machine code for x86 or ARM is about equally difficult in both cases.
I really like this writeup. Note that it may not be worth using the SIMD in this way (horizontal SIMD) if you know you will be multiplying many matrices that are the same size. It may be better to do vertical SIMD and simply perform the scalar algorithm on 4 or 8 matrices at a time, like GPUs would do for vertex shaders. This does mean that you may have to interleave your matrices in an odd way to optimize memory access, though.
It's kind of bizarre how it's an accomplishment to get your code closer to what the hardware is capable of. In a sane world, that should be the default environment you're working in. Anything else is wasteful.
There's always been a tradeoff in writing code between developer experience and taking full advantage of what the hardware is capable of. That "waste" in execution efficiency is often worth it for the sake of representing helpful abstractions and generally helping developer productivity.
The real win here is when we can have both because of smart toolchains that can transform those high-level constructs and representations into the most efficient implementation for the hardware.
Posts like this demonstrate what's possible with the right optimizations so tools like compilers and assemblers are able to take advantage of these when given the high-level code. That way we can achieve what you're hoping for: the default being optimal implementations.
> That "waste" in execution efficiency is often worth it for the sake of representing helpful abstractions and generally helping developer productivity.
That's arguable at best. I for one am sick of 'developer productivity' being the excuse for why my goddamned supercomputer crawls when performing tasks that were trivial even on hardware 15 years older.
> The real win here is when we can have both because of smart toolchains that can transform those high-level constructs and representations into the most efficient implementation for the hardware.
That's been the promise for a long time and it still hasn't been realized. If anything things seem to be getting less and less optimal.
No, it's really not even arguable. Lots and lots of software is written in business contexts where the cost of developing reliable code is a lot more important than its performance. Not everything is a commercial product aimed at a wide audience.
What you're "sick of" is mostly irrelevant unless you represent a market that is willing to pay more for a more efficient product. I use commercial apps every day that clearly could work a lot better than they do. But... would I pay a lot for that? No. They are too small a factor in my workday.
People have been sick of slow programs and slow computers since literally forever. I think you live in a bubble or are complacent.
No one I know has anything good to say about microsoft teams, for instance. And that's just one of the recent "dekstop apps" which are actually framed browsers.
> I for one am sick of 'developer productivity' being the excuse for why my goddamned supercomputer crawls when performing tasks that were trivial even on hardware 15 years older.
The problem here is developer salaries. So long as developers are as expensive as they are the incentive will be to optimise for developer productivity over runtime efficiency.
We've been making developer experience optimizations _long_ before they started demanding high salaries. The whole reason to go from assembly to C was to improve developer experience and efficiency.
It seems fairly reductive to dismiss the legitimate advantages of increased productivity. It's faster to iterate on ideas and products, we gain back time to focus on more complex concepts, and, more broadly, we further open up this field to more and more people. And those folks can then go on to invest in these kind of performance improvements.
> It seems fairly reductive to dismiss the legitimate advantages of increased productivity.
Certainly there are some, but I think we passed the point of diminishing returns long long ago and we're now well into the territory of regression. I would argue that we are actually experiencing negative productivity increases from a lot of the abstractions we employ, because we've built up giant abstraction stacks where each new abstraction has new leaks to deal with and everything is much more complicated than it needs to be because of it.
Hmm... I think our standards for application functionality are also a lot higher. For example, how many applications from the 90s dealt flawlessly with unicode text.
How much added slowness do you think Unicode is responsible for? Because as much of a complex nightmarish standard as it is[0], there are plenty of applications that are fast that handle it just fine as far as I can tell. They're built with native widgets and written in (probably) C.
[0] plenty of slow as fuck modern software doesn't handle it even close to 'flawlessly'
If developers costed one fifth of what they do now, how many projects that let performance languish today would staff up to the extent that doing a perf pass would make it's way to the top of the backlog queue?
Come on now. Let's be honest here. The answer for >90% of projects is either a faster pace on new features, or to pocket the payroll savings. They'd never prioritize something that they've already determined can be ignored.
Hmm, I don’t know about that. When I made a quarter of what I do now, I worked at companies that would happily let me spend two weeks on a task without expecting much in that time.
As far as I can tell, a slow computer is due to swapping from having a bunch of stuff open, or an inherently heavy background task and an OS that doesn't know how to allocate resources.
Sometimes there's some kind of network access bogging everything down, now that SASS is everywhere(In which case, we need more abstractions and developer productivity to enable offline work!).
Sometimes things are slow because we are doing a lot of heavy media processing. That's not inefficiency, it's just inherently heavy work. In fact, simplicity might actively slow things even more, because what you really need for that stuff is to use the GPU, and the "bloated mega library" you're avoiding might already do that.
Android is kind of the proof. Android is always pretty fast. Anything 15yo hardware could do trivially, Android can probably do faster. And Android is not lightweight.
There may be some slowness out there caused by abstraction layers, but as far as I can tell without digging into the code of every slow app I've ever seen, the real problem is the keep it simple mindset that discourages even basic optimization tricks like caching, and the NIH that makes people write things themselves and assume it's faster just because it's smaller.
Plus, the UNIXy pipeline mentality that did a number on computing. GUI workflows are not pipelines and there's lots of things that are very messy to do in a pipe style model, like changing one thing and selectively recomputing only what needs changing.
The "Data processing" way of thinking leads to producing a batch of stuff, then passing it to something that computes with it, instead of working directly with a state.
There's always been a tradeoff in writing code between developer experience and taking full advantage of what the hardware is capable of. That "waste" in execution efficiency is often worth it for the sake of representing helpful abstractions and generally helping developer productivity.
The GFLOP/s is 1/28th of what you'd get when using the native Accelerate framework on M1 Macs [1]. I am all in for powerful abstractions, but not using native APIs for this (even if it's just the browser calling Accelerate in some way) is just a huge waste of everyone's CPU cycles and electricity.
Wasteful of computing resources, yes, but for a long time we've been prioritizing developer time. That happens because you can get faster hardware cheaper than you can get more developer time (and not all developers time is equal, say, Carmack con do in a few hours things I couldn't do in months).
I do agree that we'd get fantastic performance out of our systems if we had the important layers optimized like this (or more), but it seems few (if any) have been pushing in that direction.
But we've got far more developers now, and by large, we aren't producing more capable software than we did 20 years ago. It's gotten very complex, but that is by large just a failure to keep things simple.
Spotify is slower to launch on my Ryzen 3900X than my operating system, and lacks many of the basic features that WinAmp had in 1998. Now you're thinking "Aha! But WinAmp just played mp3s, it didn't stream over the internet!", Yes it could. It was also by large developed by one guy, over the course of a couple of months.
I don't know where this superior developer productivity is going, but it sure doesn't seem to be producing particularly spectacular results.
Back when I was younger we had to develop a few simulations in university, and we spent half the semester coding the building blocks in C. I was slightly good at it, and having seen my brother stumble a couple of years before, I knew I had to be careful with the data structures and general code layout to keep things simple and working.
As this was a group project, there were other hands on deck. One weekend I caught a nasty cold and I couldn't assist the weekly meeting to work on the project. Monday comes and we have to show our advances. The code was butchered. It took me a day to fix what had been broken (and keeping egos at bay, it would've been easier to just throw everything away and implement things from my last stable version).
Now I can fire up python, import numpy and scipy, and make far more complex simulations in a couple of minutes and a few lines of code. Sure, back then python and numpy did exist, I just didn't know about them. But you know what didn't exist 10-15 years ago? Pytorch, TensorFlow, JAX and all the deep learning ecosystem (probably scikit-learn did exstist, it's been around forever!). Thanks to those, I can program/train deep learning algorithms for which I probably wouldn't be able to code the lower-level abstractions to make them work. JAX comes with a numpy that runs on hardware "for free" (and there was PyCUDA before that if you had a compatible GPU).
That's the programmer productivity you're not seeing. Sure, these are very specific examples, but we have many more building blocks available to make interesting things without having to worry about the lower layers.
You can also complain about Electron being a bloated layer, and that's OK. There's your comparisson about how Spotify is slow and Winamp is/was fast.
That's kinda like saying Bob built a go-kart in his garage over a couple months, it moves you from A to B, I don't see why a Toyota Corolla needs a massive factory. Spotify's product isn't just a streaming media player. It's also all the infrastructure to produce the streams, at scale, for millions of users.
How often are you actually launching spotify? I start it once when I boot and that's it until my next reboot, weeks/months later.
Now you might of course ask, "why isn't the velocity 6554x that of winamp, even when correcting for non-eng staff, management, overhead, etc". Well, they probably simply aren't allocating that much to the client development.
Also often times one dev who knows exactly what he is doing can be way more effective than a team; no bikeshedding, communication, PRs, etc. What happens when they get hit by a bus?
But you can't get faster hardware cheaper anymore. Not naively faster hardware anyways. You are getting more and more optimization opportunities nowadays though. Vectorize your code, offload some work to the GPU or one of the countless other accelerators that are present on modern SOC, change your I/O stack so you can utilize SSDs efficiently, etc. I think it's a matter of time until someone puts FPGA onto mainstream SOC, and the gap between efficient and mainstream software will only widen from that point.
You are precisely telling me the ways in which I can get faster hardware: GPU, accelerators, the I/O stack and SSDs, etc.
I agree that the software layer has become slow, crufy, bloated, etc. But it's still cheaper to get faster hardware (or wait a bit for it, see M1, Alder Lake, Zen 3, to name a few, and those are getting successors later on this year) than to get a good programmer to optimize your code.
And I know that we'd get much better performance out of current (and probably future) hardware if we had more optimized software, but it's rare to see companies and projects tackle on such optimization efforts.
But you can't get all these things in the browser. You don't just increase CPU frequency and get free performance anymore. You need conscious effort to use GPU computing, conscious effort to ditch current I/O stack for io_uring. Modern hardware gives performance to ones who are willing to fight for it. Disparity between naive approach and optimized approach grows every year.
I'm not arguing against your point at all. I'm just pointing at the fact that there are many who are happy to wait a year, buy a 10-20% faster CPU and call it a day, or buy more/larger instances on the cloud, or do anything but optimize the software they use. Some couldn't even if they wanted, because they buy software rather than develop in-house, and aren't effective in requiring faster performance from their vendors.
Everything has a cost. If the developer is a slave to machine architecture, development is slow and error prone. If the machine is a slave to a abstraction, everything will run slowly. Unsurprisingly, the real trick is finding appropriate balance for your situation.
Of course you can make things worse, in both directions.
The real issue here is that the hardware isn't capable of sandboxing without introducing tons of side channel attacks. Lots of applications are willing to sacrifice a lot of performance in order to gain the distribution advantages from a safe, sandboxed execution environment.
This is the real issue. The other comments here are talking about dev-productivity, but they have the causal chain backwards. The browser overhead is about running in a sandbox, on arbitrary/unknown hardware. Web development had poorer DX than desktop app development for a long time, until the critical mass of web-devs (driven by the distribution advantages) finally drove the DX investment to heavier on the web side.
On the other hand, in your sane world, productivity would be a fraction of what it currently is, for developers and users. You favor computer time over developer time. While computer time can be a proxy for user time, it isn't always as developer time can be used to speed up user time too. A single-minded focus on computer time sounds like a case of throwing out metrics like developer time and user time because they are harder to measure than computer time. In any case, it sounds like a mistake to me.
In a sane world (which is the world that we live in), it's best to find a well-optimized library for common operations like matrix multiplication. But if you want to do something unusual (multiply large matrices inside a browser, quickly) you've exited the sane world so you'll have to work at it.
> Let's say I want to interactively plot some complicated function without slowing down the rest. Can I do this in WebAssembly now?
You've been able to do that for a while now and it would likely be fast enough even in pure JavaScript. The things which tend to slow the web down come back to developers not caring about performance (and thus not even measuring) or the cross-purposes of things like ad sales where a significant amount of JavaScript is deployed to do things the user doesn't care about.
Browsers will always lag behind desktop applications by a decade or two because everything is designed by committee and takes forever to arrive (see e.g. SSE, AVX, GPGPU compute). And even if everyone can eventually agree on and implement the smallest common denominator hardware will already have evolved beyond that.
In addition, browsers have to fight all kinds of nefarious attackers, so it is a very hostile environment to develop in. For example, we can't even measure time accurately (or do proper multithreading with shared memory) in the browser thanks to the Spectre and Meltdown vulnerabilities. https://meltdownattack.com/
That being said, WebGL implements extremely gimped shaders. Yet, they are still more than enough to render all kinds of functions. For example, see https://www.shadertoy.com/ or https://glslsandbox.com/ which are huge collections of functions which take screen coordinates and time as input and compute pixel colors from that, i.e. f(x, y, t) -> (r, g, b). This might sound not very impressive on first glance, but people have been amazingly creative within this extremely constrained environment, resulting in all kinds of fancy 3D renderings.
> Let's say I want to interactively plot some complicated function without slowing down the rest. Can I do this in WebAssembly now?
Yep, with a Web Worker for the secondary thread. However the environment is still young and its' difficult to use heavy computation libraries. Besides for some reason SIMD instructions are present only for 128 bit units (2 doubles or 4 floats). Another problem is no matter what to do it is a layer over the hardware, so it will be slower than specialized machine code (if what you do is not in the JS API)
what do you mean? What's the blocker? Something like numpy for js would fill this role, calling wasm in the background. Just the missing SIMD-instructions? Some quick googling shows that one can't compile BLAS for wasm yet. This might be due to wasm64 not being available yet, i think? So would this help to tap into the existing ecosystem of optimised mathematical routines?
Ideally...I would leave js and just use python ;) It has the whole ecosystem at hands with numpy, scipy, statsmodels etc. But nobody is doing it and idk why. I think it might be due to fortran not compiling to wasm.
Let's say I want to interactively plot some complicated function without slowing down the rest
If you can draw your plot in a shader then you can do it in WebGL very easily. You'd only need to update the input uniforms and everything else would happen on the GPU.
The BLAS have historically relied on a wide range of microarchitecture-specific optimizations to get the most out of each processor generation. An ideal solution would be for the browser to provide that to the application in such a way that it is difficult to fingerprint.
See also the history of Atlas, GotoBLAS, Intel MKL, etc.
libflame/BLIS might be a good starting point, they've created a framework where you bring your compute kernels, and they make them into a BLAS (plus some other nice functionality). I believe most of the framework itself is in C, so I guess that could somehow be made to spit out wasm (I know nothing about wasm). Then, getting the browser to be aware of the actual real assembly kernels might be a pain.
tl;dr, the M1 can do ~1300 GFLOP/s and the M1 Pro up to ~2700GFLOP/s.
On the vanilla M1, that's 28 times faster than the best result in the post.
The difference (besides years of optimizing linear algebra libraries) is that Accelerate uses the AMX matrix multiplication co-processors through Apple's proprietary instructions.
People are asking about BLAS already in various threads, but if they know the size of their matrices beforehand, it might be interesting to try EIGEN. EIGEN also has the benefit that it is an all-template C++ library so I guess it should be somehow possible to spit out the WASM (I know nothing about WASM).
Of course usually BLAS beats EIGEN, but for small, known-sized matrices, it might have a chance.
I'm by no means an expert but my understanding is that a lot of the performance from libraries like openBLAS comes from targeting specific architectures (e.g. particular instruction sets on a series of processors). You can probably milk some more performance by targeting the web assembly architecture specifically (assuming openBLAS hasn't started doing similar themselves).
I tested it on my M1 Mac and it reached 46.78 Gigaflops, which is quite amazing for a CPU running at 3.2 GHz. Isn't that like an average of 14.6 floating point operations per clock cycle?
I hate to post this multiple times, but the M1 has a dedicated matrix co-processor, it can do matrix multiplication at >1300GFLOP/s if you use the native Accelerate framework (which uses the standard BLAS API). The M1 Pro/Max can even do double that (>2600 GFLOP/s) [1].
46.78 GFLOP/s is not even that great on non-specialized hardware. E.g., a Ryzen 5900X, can do ~150 GFLOP/s single-threaded with MKL.
I don't understand the naming and notation of this article because the author is assuming context that I don't have.
Section baseline:
What are N, M, K? 3 matrices or? Laid out as a flat array, or what? `c[m * N + n] += a[m * K + k] * b[k * N + n];`, ah, apparently a b and c are the matrices? How does this work?
Section body:
What is the mathy "C′=αC+A⋅B"? derivative of a constant is the angle times the constant plus the dot product of A and B???
Please, if you write a public blog post, use your head. Not everybody will understand your terse notes.
I don't understand the naming and notation of this article because the author is assuming context that I don't have.
Section baseline:
What are N, M, K? 3 matrices or? Laid out as a flat array, or what? `c[m * N + n] += a[m * K + k] * b[k * N + n];`, ah, apparently a b and c are the matrices? How does this work?
Section body:
What is the mathy "C′=αC+A⋅B"? derivative of a constant is the angle times the constant plus the dot product of A and B???
There are 3 matrices in question: A, B and C. They have dimensions (M * K), (K * N) and (M * N) respectively.
They are laid out, rather than nested arrays, as a single continuous collection of bytes that can be interpreted as having a matrix shape. That's where the `m * N + n` comes from (m rows down and n cols in)
C' = alpha C + A.B
This is the 'generalised matrix-matrix multiplication' (GEMM) operation. It's multiplying the matrices A and B, adding it to a scales version of C and inserting it back into C. Setting alpha to 0 gets you basic matmul
Is "the math stuff" all the optimizations being performed e.g. vectorizing multiplication?
Not trying to sound dismissive here but the core math the post is working with is actually a pretty straightforward matrix multiplication.
The bulk of the discussion focuses on optimizing the execution of that straightforward multiplication algorithm [triple-nested for loop; O(n^3)] rather than making algorithmic/mathematic optimizations.
And again, specific questions are easier to answer. :)
In an ideal world, absolutely! It's a hard problem and there are many attempts to make that happen automatically including polyhedral optimization (Polly[1]) and tensor compiler libraries (XLA[2] and TVM[3]). I work on a project called LoopTool[4] which is researching ways to dramatically reduce the representations of the other projects to simplify optimization scope.
Another one I tried ages ago is to use a single loop counter and "de-interleave" the bits to get what would normally be 3 distinct loop variables. For this you need to modify the entry in the result matrix rather than having it write-only. It has the effect of accessing like a z-order curve but in 3 dimensions. It's a bit of overhead, but you can also unroll say 8 iterations (2x2x2) which helps make up for it. This ends up making good use of both caches and even virtual memory if things don't fit in RAM. OTOH it tends to prefer sizes that are a power of 2.