Statically Recompiling NES Games into Native Executables with LLVM and Go (2013)

ZenPsycho · on March 18, 2016

I have actually been rather obsessed with this lately. You'll see that there's a few problems he runs into that he deems insurmountable, which sounds like a challenge! Specifically, there's the issue with an instruction which is effectively a computed goto:

A jump instruction that takes an address and then jumps to the address STORED at that address. Since there is no way to know at compile time what addresses are going to stored at a place, you're forced to then dynamically emulate the whole memory space of the actual NES to accurately calculate it, thus defeating the whole point.

Is that the only solution though? a head scratcher!

Further are issues with parts of code that, on some level seems to be taking inspiration from genetics: Jump to one alignment, and the instructions get interpreted one way, jump to a different alignment and the same sequence of bytes is interpreted by the CPU as an entirely different set of instructions. I wonder if that could be resolved by creating a different source code path for each alignment using flow analysis- A space saving technique effectively getting uncompressed.

stormbrew · on March 18, 2016

I suppose it's possible this could be considered "emulating the whole memory space," though I wouldn't consider it that, but you could just generate an offset table for all jumpable memory locations and use them to calculate the correct offset. It's quite likely you could even do that in less than O(n) space with some time trade-offs.

ZenPsycho · on March 18, 2016

I was commenting on my memory of the article.

Upon rereading it, it seems the real insurmountable challenge is interrupts! It always seems like at that level of detail with this stuff, it becomes a decision to go slower in order to simulate NES hardware accurately. But my question is: how much code depends on that accuracy, and how much can you compromise for speed? A further question I have is can you do a cross game analysis of the whole NES library and find common patterns, reused functions, that you decompile into a kind of high level conversion that would hopefully or gloss over the need for specific functions to have that low level accuracy.

jtolmar · on March 18, 2016

I'm pretty sure Nesticle had nothing resembling accurate timing, and it could run most games.

You can go well beyond several of the NES's limitations by fiddling with PPU parameters between scanlines, but that takes careful timing and almost nothing actually does that. Might only be democoders. If you're willing to ignore the possibility that someone did that, you should be able to just process a whole frame's worth of PPU at NMI time.

Super Mario Bros spends most of its time in an infinite loop, waiting for NMI to pull it out so it can start processing the next frame. If NMI triggers before this, you get a lag frame. Plenty of popular emulators get the wrong lag frames, but only speedrunners and TASers really care. So, for this game, you could possibly toss out the timing entirely and trigger NMI when you hit a busy loop. Your emulator will completely ignore all lag frames, but most people won't notice for at least this game.

(It's been over a decade since I was into ROM hacking so my memory could be faulty on any of this.)

khedoros · on March 18, 2016

SMB also relies on polling the sprite0 flag to know when to write to the x-scroll value. If the timing isn't right, most of the playfield jitters back and forth. Same thing happens with other ROMs that use the same method (like Balloon Fight's "Trip" mode and, bizarrely, Arkanoid). My emulator assumes that there won't be a lot of shenanigans (PPU writes) before the sprite0 flag, and pre-calculates the cycle that it'll hit on. That seems to work.

One of the Ninja Turtles games shifts the x scroll back and forth a couple times per line during some of the opening screens. Skate or Die changes which CHROM bank to use for background tiles a couple times per frame in its inter-level screens. I solve that by logging PPU writes (address, value, cycle it would be received). In practice, most of those mid-frame writes don't change anything that matters to the CPU, but I need the info for rendering.

Rendering the scene at NMI without logging writes to the PPU works for a lot of games, but there are plenty that do mid-frame PPU changes (and it starts getting more difficult with more-advanced mappers including their own interrupts).

Drakim · on March 18, 2016

Interrupts need to be fairly accurate. Lots of games use them to have unmoving status bars, and rely on them triggering on just the right scanline.

ZenPsycho · on March 18, 2016

I wonder if for things that happen between and among scan lines, how difficult it would be to statically analyse the code to find out what's running during that time, and precalculate when the writes would fire.

another approach could be getting runtime information from a running emulator, and record it onto a file the compiler could use to close the gaps you need but can't get from static analysis.

Drakim · on March 18, 2016

The issue is that nothing is stopping the developers from doing all kinds of crazy loops and jumps in the code that's invoked by the interrupt, making it into a halting problem.

Unless the interrupt handler is trivially simple, you can't know how many cycles it will take to finish.

ZenPsycho · on March 18, 2016

the advantage the compiler writer has in this case is it doesn't have to deal with any possible code, just this set of well known already released and frozen in time programs.

Drakim · on March 18, 2016

I don't think that really avoids the problem. I mean, what if you do a JMP to an address decided based on a value gained from player input? How is the compiler supposed to guess that in advance?

ZenPsycho · on March 18, 2016

the mathematical proof of the halting problem, as I understand it, would in this context be based on the notion that user input could somehow direct the program to write a program in ram and get it to execute.

This has in fact been achieved in games like super mario world or pokemon!

It seems to me sometimes that educated programmers call halting problem too early on situations like this where you don't actually need a rock solid "provably correct" compiler, just a compiler that does something reasonable in this bounded set of cases, and we don't mind that much if it crashes sometimes, since it's not like an NES game is a rocket or a therac.

Drakim · on March 18, 2016

Interrupts isn't something you can compromise on though, it would break all but the most trivial NES games (SMB and SMB3 both rely on interrupts).

I assume the trouble with interrupts is the "interrupting", as the location to jump to could be kept in a prefilled table like stormbrew suggested for JMPs.

If the target instruction-set doesn't have any way of "interrupting" that could be used, could it instead be solved by inserting a compact interrupt check between every "line" of assembly?

ZenPsycho · on March 18, 2016

My understanding is the trouble with interrupting is that from the point of view of mainstream code, the state of all the global variables just changed out from under you. as if by another thread with shared memory.

Which makes me naively wonder if you could just run interrupts on a parallel thread.

Drakim · on March 18, 2016

The NES just let developers deal with that, pretty much all interrupt code starts with the developer frantically saving the registers somewhere so that he can restore them at the end of the interrupt.

niccaluim · on March 18, 2016

Isn't that required of interrupt handlers on any architecture, even modern ones?

Drakim · on March 19, 2016

Is it? I was so sure that modern architectures could just push your entire set of registers away to a stack with a single instruction, since it's so common.

niccaluim · on March 20, 2016

I honestly don't know. I haven't done any assembly since the days of the 680x0.

nathancahill · on March 18, 2016

I had a fun time playing Prince of Persia on DOSBox. You can change the number of CPU cycles the emulator uses[0], which can cause interesting glitches when it executes too fast.

[0] http://www.dosbox.com/wiki/Performance

devbug · on March 18, 2016

TLB :)

gregpardo · on March 18, 2016

Back in the romhacking scene we had emulators that the longer you would play through the more they could map out the entire assembly/data. I also had a friend who wrote a disassembler that used this information as well as some tasty algorithms to get a complete disassembly of SNES games.

DannyBee · on March 18, 2016

" Since there is no way to know at compile time what addresses are going to stored at a place"

Why? This sounds like symbolic evaluation. You certainly can't know in all cases. But at worst, you can come up with the set of possible jump targets.

RodgerTheGreat · on March 18, 2016

Part of the issue is that even with pretty fine-grained symbolic execution there are simple programming patterns which can induce the entire memory space, or a very large portion of it, as a possible jump target.

theoh · on March 18, 2016

Are those patterns likely to be used, though? If one is detected, could the tool we're hypothesizing about inform a human who'd be able to understand what was going on?

I'm always confused between symbolic evaluation and abstract interpretation but, according to Wikipedia, abstract interpretation is the more general term. I'd love to have the time and the ability to work on this kind of thing, up to and including partial evaluation. Really a lot of potential in this area, I think.

RodgerTheGreat · on March 19, 2016

If you're really interested, the decompiler I built as part of this project performs a symbolic execution of programs for a very simple architecture and (conservatively) tracks the values which could reach registers at a given point in the program:

https://github.com/JohnEarnest/Octo/blob/gh-pages/js/decompi...

I call the technique "register smearing" and it's only remotely feasible because Chip8 has lots of registers (if an accumulator was constantly being clobbered you wouldn't get much useful information) and programs are exceedingly small (<3.5kb).

As you'd expect, for simple programs this works great and has even helped find bugs in some of the example Chip8 ROMs in the wild. However, it rapidly breaks down when you start working with memory-intensive programs or anything involving self-modifying code. There's still room for improvement, but at the end of the day you can't solve the halting problem and there is a diminishing return on greater complexity in your decompiler.

gulpahum · on March 18, 2016

One major problem are games which generate code into ram and then execute it. I can't remember if there were any NES games doing that, but I've seen other 6502 based games doing that.

jimsmart · on March 18, 2016

My anecdotal evidence ties with yours, to a greater degree. Not specifically generating code to run, but I certainly wrote plenty of self-modifying code when coding games on the C64, although it was never a trick I personally used on the NES - two reasons: by default your code runs from ROM anyhow (so there's just less of a tendency to consider self-mod code in the first place), plus the RAM on the NES is so very limited.

Drakim · on March 18, 2016

Maybe a tiny VM could be added just for runtime generated code.

(but to be honest I don't think there is a single NES game that does that)

TickleSteve · on March 18, 2016

dynamically generated code was quite a common technique in the era of 6502. Its frowned upon now, but it was a major optimisation technique back then.

Drakim · on March 18, 2016

Do you have any resources on such techniques and tricks? I'm very much into 6502 from a NES background but I know little about dynamically generated code. I understand how it works, I just can't imagine what I would potentially use for it. :)

jerf · on March 18, 2016

Speed. 6502s are 8-bit processors running in the neighborhood of 1MHz. Anything you can do to squeeze out a cycle is worth it. Rather than writing something that will repeatedly examine the data, and makes decisions, and then does things, write code that simply does things. Much faster, if you can spare the RAM for the code. Plus if you're really careful and clever, the code to "do things" can itself be the data! Sometimes, anyhow.

Also, the world has changed a lot since then. Interpreters have less penalty on a modern chip than an old stupid chip, because branch prediction, prefetching, and multiple pipelines can really help with them, so it's relatively speaking cheaper to examine data and make decisions and the CPU will spend more time "doing things" as long as the data required and the branches taken are predictable, which they often are in this sort of code. And on the flip side, modern processors really want your code to be static, precisely so that all those optimizations can work well, along with code caches, micro-op caches, etc... constantly changing the code isn't good for performance on modern chips. The 6502 doesn't care how much the code is changing, it just executes the next opcode at the same speed regardless.

Very different world.

gulpahum · on March 18, 2016

One case would be configurable code. The code is copied to ram and then tweaked based on some parameters. The modified code runs faster than having a bunch of branches to check/calculate result based on the parameters. It might even be easier to write that kind of code.

Another case is to access a large range of memory, for instance, to fetch data from a large table. You have the code in ram and increment the high-byte address (HH) of a 'lda $HH00,x' or 'sta $HH00,x' to access larger ram area (because x index can only access 256 bytes). I've seen that in Vic-20 games, I don't know if NES games used it.

One case is loop unrolling, a.k.a. speedcode. This is very common practice in modern 6502 demos. I don't think that the old NES games used it, but some modern NES demos may use it. See http://codebase64.org/doku.php?id=base:speedcode or http://csdb.dk/forums/index.php?roomid=11&topicid=96279&show...

gp2000 · on March 18, 2016

I don't have a 6502-specific example, but scaling an image with point sampling is one place I've seen it done. The inner-loop to scale a line might look like this:

  frac = 0;
  while (len > 0) {
    for (frac += scale; frac >= 1; frac--)
      *dst++ = *src;
    src++;
  }

Instead of repeating the work of the inner loop every time, generate the code that has the scale baked in. For instance, doubling the image would output this code fragment repeated for the width of the source image:

  *dst++ = *src; *dst++ = *src; src++;

And scaling down by half would be:

  *dst++ = *src; src++; src++;

Though whether this is faster or the best technique really depends on the processor. It might be just as easy to pre-compute a lookup table pointing to the offset of the source pixel for each destination one. That's something an 8086 could do fairly easily and maybe a 6502, but not so much for a Z-80.

woodman · on March 18, 2016

The only potential practical use that I can think of, outside of procedurally generated decompression engines, is this:

You find yourself on an uncharted desert island with two sailors, a movie star, some other lady, a millionaire and his wife... and a crate of 4k roms. For reasons that would take far too long to explain here - your only salvation is to recreate the Atari game catalog on your coconut game console.

jimsmart · on March 18, 2016

On the C64, when scrolling, the fastest way to update the screen colour RAM (which cannot be pointed to another address, unlike the screen character map) is to do it all with immediate LDA #$xx / STA $D800 / LDA #$xx / STA $D801 / etc (where xx are your colours, and $D800 is the base address of the colour RAM) - then during the 7 frames of pixel scrolling, you both offset-copy the screen characters to the second screen buffer, and also move all the character colour info (the values throughout the previous LDAs) through the colour RAM update code, and on the 8th frame you flip character screens and then call the colour RAM splat subroutine.

[edit] It's much faster than the obvious solution, which is to do LDA $D801 / STA $D800 / LDA $D802 / STA $D801 / etc. Or, even worse, a loop incrementing the X register with LDA $D801,X / STA $D800,X / etc. Or worse still: indirection via zero-page (though no-one should really ever consider that for colour RAM updates, even though at first glance it seems clever for moving the characters between screens, it eats too much time)

(This technique is pretty much only applicable if your scroll is fixed-direction and fixed-speed)

[edit #2] Credit for that goes to Jon Williams (Shadow Dancer, C64 - and others) for adding that optimisation to my scroll routines that he used in SD - and for then telling me the trick :)

peteri · on March 18, 2016

Sadly I don't have any at the moment but I recall using self modifying code to write graphics to an Apple ][ hires screen.

Thinking about it I wonder why I didn't copy the core of that down to the zero page and write the value into the instructions there. STA $12 is a cycle faster than STA $1234 maybe I couldn't find the space (or I was being kind to the OS + basic)

ris · on March 19, 2016

"Since there is no way to know at compile time what addresses are going to stored at a place"

Well, this is exactly what LLVM's (admittedly limited) mem2reg pass is for.

maplant · on March 18, 2016

Well clearly it is not impossible to recompile the program, perhaps by hand, into a different instruction set. Arguably this may be considered source to source translation. Sure it is hard, and there's no program that will EVER be able to do it automatically I assume. But that does not leave out the possibility of hand translation, which may prove an effective means by super skilled programmers of the future.

onnoonno · on March 18, 2016

Yes, it looks like the real problems occur (as also mentioned in his post) whenever encountering self-modifying code.

I wonder whether it would be possible to detect and then either pattern match or manually resolve those cases of self-modifying code, in case they are few and contained to a small section of code each?

orik · on March 18, 2016

here's the previous discussion: https://news.ycombinator.com/item?id=5838326

(i've realized; is this even necessary with the 'past' button?)

Splines · on March 18, 2016

I never knew there was a "past" button...

noobermin · on March 18, 2016

Oh dear, I'm coming up on two and a half years on HN (time flies!) and I didn't know either...

boulos · on March 18, 2016

Don't worry, it's recent and mostly just designed to help you figure out if you should flag an article as a dupe.

hias · on March 18, 2016

Never seen that button. Speaks for the UI designer, that a comment about a function is more visible than the function itself ;-) Just kidding, I am a fault for not looking right!

ZenoArrow · on March 18, 2016

Whilst we're on the subject, there's another function you may have missed. If you've got showdead on (via your account settings) and you see a dead post or story that you thought was worth sharing, click on the timestamp by the post and click on the 'vouch' link there to suggest that it shouldn't be hidden.

SixSigma · on March 18, 2016

I should really be a bit more visible

madez · on March 18, 2016

I can’t help it. Reading this feels a bit like hidden political propaganda. It’s ridden with subtle and not so subtle negative references to gcc, the fsf and ideals.

Probably it’s me reading too much into it but it makes it hard to enjoy.

tibbon · on March 18, 2016

Does anyone have information with how some of the "multi cart" games worked, like Mario/Duck Hunt/Track Meet? Surely, all 3 games didn't fit in 32k right?... right?

http://nintendo.wikia.com/wiki/3-in-1_Super_Mario_Bros._/_Du...

jimsmart · on March 18, 2016

An educated guess (having coded the NES) is that it's simply a bigger ROM, and the cart contained some kind of MMC chip [0] which allows different sections from the ROM to be paged-in - the menu screen contains the code to page-in the applicable bank from the ROM, and then the game runs per normal, not being aware of any of this.

[0] https://en.wikipedia.org/wiki/Memory_management_controller

andrewvijay · on March 18, 2016

Too low level for me. But since that I've started go as my first low level statically typed language, I think I'll just bookmark this article and may be read after a few years!

spriggan3 · on March 18, 2016

> Too low level for me. But since that I've started go as my first low level statically typed language

I don't think any garbage collected language can be called "low level". If you really want to go low level, learn C and ASM. Manual memory management is the real deal.

dkopi · on March 18, 2016

I don't think any compiled or assembled language can be called "low level". If you really want to go low level, learn Machine code and CPU architecture.

...

I don't think any executed language can be called low level. If you really want to go low level, learn hardware design, VHDL/VERLIOG

...

I don't think any hardware description language can be called low level. If you really want to go low level, build your own logic gates out of transistors

...

daodedickinson · on March 18, 2016

I'm with Nietzsche, the real low-level programming involves feeling new feelings and somehow communicating/philosophizing/programming them into other people.

lmm · on March 18, 2016

Go is not appreciably more low level than e.g. Haskell though. It leaves out high level constructs but it doesn't offer you any more control to make up for it.

vvanders · on March 18, 2016

Having done all 3 of those, anything that involves manual memory management counts as low level in my book.

andrewvijay · on March 18, 2016

Ha ha.

andrewvijay · on March 18, 2016

"for me" is something you all have to notice. I started learning programming while doing hard labor at day. I have been in this only for a couple of years and I'm not a computer science graduate. So that is why I said it's low level for me. Anyway your explanation seems to be reasonable. Think I learnt some.

spriggan3 · on March 18, 2016

Don't take things personally. My intent wasn't to belittle you or whatever. We are having a simple technical discussion.

andrewvijay · on March 19, 2016

Totally understand. No offense taken! Cheers!

sleepychu · on March 18, 2016

Everything is relative. Also I think your parent was referring to the NES reimplementation.

gcc_programmer · on March 18, 2016

gcc -Wall -fno-diagnostics-color

If you are going to list architectures supported, also list the ones gcc supports - it's, basically, all of them.

Llvm+clang is nice, but not better, drink the kool aid. Gcc performance is still higher, and gcc is free software.

techdragon · on March 18, 2016

Clang is also free... Just not your preferred definition of Free.

gcc_programmer · on March 18, 2016

Free as in free beer, not free as in free speech :P

over · on March 18, 2016

The FSF considers BSD a free software license, it's just non-copyleft. There was a lot of discussion on Groklaw about taking BSD code, modifying it, and slapping a GPL license on top. I believe the consensus was this is OK, since you are still respecting the BSD license terms.

sgt101 · on March 18, 2016

also from http://www.gnu.org/philosophy/free-sw.en.html...

>Freedom 3 includes the freedom to release your modified versions as free software. A free license may also permit other ways of releasing them; in other words, it does not have to be a copyleft license. However, a license that requires modified versions to be nonfree does not qualify as a free license.