Interesting post, but the last bit of logic pointing to the Neural Engine for MLX doesn’t hold up. MLX supports running on CPU, Apple GPU via Metal, and NVIDIA GPU via CUDA: https://github.com/ml-explore/mlx/tree/main/mlx/backend
More specific info on the reference card is available in the paper's supplemental information. https://ieeexplore.ieee.org/ielx8/83/10795784/11125864/supp1.... Basically, they used special paper with a pro-Canon inkjet, along with a special ICC color profile.
Verification is indeed the majority of the time spent. Unlike programming, Verilog and VHDL and higher level things like Chisel aren’t executed serially by the hardware they describe like a von Neumann machine. Hello World for a chip isn’t designing the circuit, or simulating the circuit, or synthesizing the circuit to some set of physical primitives. No, it’s proving that the circuit will behave correctly under a bunch of different conditions. The less commoditized the product, the more important it is to know the real PDK, the real standard cell performance, what to really trust from the foundry, etc. Most of the algorithms to assist in this process are proprietary and locked behind NDAs. The open source tools are decades behind the commercial ones in both speed and correctness, despite heavy investment from companies like Google.
And so my point: the place where people best know how to make chips competitively in a cutthroat industry is NOT in schools, but in private companies that have signed all the NDAs. The information is literally locked away, unable to diffuse into the open where universities efficiently operate. Professors cannot teach what they don’t know or cannot legally share.
Chip design is a journeyman industry. Building fault-tolerant, fast, power-efficient, correct, debuggable, and manufacturable designs is table stakes. Because if not, there are already a ton of chip varieties available. Don’t reinvent the wheel because the intersection of logic, supply chain logistics, circuit design, large scale multi objective optimization, chemistry, physics, materials science, and mathematical verification is unforgiving.
There is some recent work [0] that explores this idea, scaling up n-gram models substantially while using word2vec vectors to understand similarity. Used to compute something the authors call the Creativity Index [1].
I take your example on multitasking operating systems as not being limited to only helping make friendly asynchronous I/O, but I do think a deeper consideration of Multics is coincidently appropriate.
The telephone and electrical power networks were vast in scope (and still are), enabling interstate communication and power utilities. Echoes of the transportation utilities enabled through railroads. Multics was architected partially with the commercial goal of scaling up with users, a computing utility. But in a time with especially expensive memory, a large always resident kernel was a lot of overhead. The hardware needed a lot of memory and would be contending with some communication network whose latency could not be specified at the OS design time. Ergo, asynchronous I/O was key.
Put differently, Multics bet that computing hardware would continue to be expensive enough to be centralized, thereby requiring a CPU to contend with time-sharing across various communication channels. The CPU would be used for compute and scheduling.
Unix relaxed the hardware requirements significantly at the cost of programmer complexity. This coincided roughly with lower hardware costs, favoring compute (in broad strokes) over scheduling duties. The OS should get out of the way as much as possible.
After a bunch of failed grand hardware experiments in the 1980s, the ascendant Intel rose with a dominant but relatively straightforward CPU design. Designs like the Connection Machine were distilled into Out of Order Execution, a runtime system that could extract parallelism while contending with variable latency induced by the memory subsystem and variable instruction ordering. Limited asynchronous execution mostly hidden away from the programmer until more recently with HeartBleed.
Modern SoCs encompass many small cores, each running a process or maybe an RTOS, along with multiple CPU cores, many GPU cores, SIMD engines, signal processing engines, NPU cores, storage engines, etc. A special compute engine for all seasons, ready to be configured and scheduled by the CPU OS, but whose asynchronous nature (a scheduling construct!) no longer hidden from the programmer.
I think the article reflects how even on a single computer, the duty of the CPU (and therefore OS) has tilted in some cases towards scheduling over compute for the CPU. And of course, this is without considering yet cloud providers, the spiritual realization of a centralized computing utility.
These are good points. I hadn't thought about the perspective that the central processor in a heterogeneous multicore system may spend a lot of its time orchestrating rather than computing—whether it's a GE 635 with its I/O controllers https://bitsavers.org/pdf/ge/GE-6xx/CPB-371A_GE-635_System_M..., an IBM 360 with its "channels" https://en.wikipedia.org/wiki/IBM_System/360_architecture#In..., or a SoC with DSP cores and DMA peripherals—but it's obviously true now that you say it. I've seen a number of SoCs like the S1 MP3 player and some DVD players where the "central processor" is something like a Z80 or 8051 core, many orders of magnitude less capable than the computational payload.
(One quibble: I think when you said "HeartBleed" you meant Meltdown and Spectre.)
I think there have always been significant workloads that mostly came down to routing data between peripherals, lightly processed if at all. Linux's first great success domains in the 90s were basically routing packets into PPP over banks of modems and running Apache to copy data between a disk and a network card. I don't think that's either novel or an especially actors-related thing.
To the extent that a computational workload doesn't have a major "scheduling" aspect, it might be a good candidate for taking it off the CPU and putting it into some kind of dedicated logic—either an ASIC or an FPGA. This was harder when the PDP-11 and 6502 were new, but now we're in the age of dark silicon, FPGAs, and trillion-transistor chips.
It's too bad that they took the physicality of the monkeys into account, but assumed the key probabilities were IID. It would have been nice to see the effect of keyboard layout on the overall probabilities. Key mashing would clearly make nearby keys much more likely to be pressed in a sequence, implying that there might be an optimal keyboard layout for each phrase. And that’s before considering soft keyboards with autocomplete.
reply