Hacker News new | past | comments | ask | show | jobs | submit login

We have a production service running for years that just mmaps an entire SSD and casts the pointer to the desired C++ data structure.

That SSD doesn't even have a file system on it, instead it directly stores one monstrous struct array filled with data. There's also no recovery, if the SSD breaks you need to recover all data from a backup.

But it works and it's mind-boggingly fast and cheap.




I've always wanted a Smalltalk VM that did this.

Eternally persistent VM, without having to "save". It just "lives". Go ahead, map a 10GB or 100GB file to the VM and go at it. Imagine your entire email history (everyone seems to have large email histories) in the "email array", all as ST objects. Just as an example.

Is that "good"? I dunno. But, simply, there is no impedance mismatch. There's no persistence layer, your entire heap is simply mmap'd into a blob of storage with some lightweight flushing mechanic.

Obviously it's not that simple, there's all sorts of caveats.

It just feels like it should be that simple, and we've had the tech to do this since forever. It doesn't even have to be blistering fast, simply "usable".


That is so wonderfully fascinating to me. You could just download a file into a variable and when that variable goes out of scope/has no more references it’d just be automatically “deleted”. Since there’s no longer a concrete “thing” called a file, you can organize them however you want and with whatever “metadata” you want by having a dict with the metadata you want and some convention like :file as the key that points to the body. Arbitrary indexes too; any number of data structures could all share a reference to the same variable.

Simple databases are just made up of collections of objects. Foreign key constraints? Just make the instance variable type a non-nullable type. Indexes? Lists of tuples that point to the objects. More complex databases and queries can provide a set of functions as an API. You can write queries in SQL or you can just provide a map/filter/reduce function with the predicate written in normal code. Graph databases too: you can just run Dijkstra’s algorithm or TSP or whatever directly on a rich persistent data structure.

Thanks for the neat idea to riff on. I like it! Thinking about it in practice makes me a little anxious, but the theory is beautiful.


So, I've occasionally played around with a language that pretty nearly does this.

Mumps is a language developed in 1967, and it is still in use in a few places including the company where I work.

The language is old enough that the first version of it has "if" but no "else". When they added "else" later on it was via a hack worthy of this post: the "if" statement simply set a global variable and the new "else" statement checked that. As a result, "if-else" worked fine but only so long as you don't use another "if" nested within the first "if" clause (since that would clobber the global variable). That was "good enough" and now 50 years later you still can't nest "if" statements without breaking "else".

But this very old language had one brilliant idea: persistence that works very much the way you describe. Any variable whose name begins with "^" is persisted -- it is like a global variable which is global, not just to this routine but to all of the times we execute the program.

It is typical to create single variables that contain a large structure (eg: a huge list with an entry for each customer, indexed by their ID, where the entry contains all sorts of data about the customer); we call these "tables" because they work very much like DB tables but are vastly simpler to access. There's no "loading" or impedance mismatch... just refer to a variable.

Interestingly, the actual implementation in modern day uses a database underneath, and we DO play with things like the commit policy on the database for performance optimization. So in practice the implementation isn't as simple as what you imply.


That global persistence model across executions is very fascinating. If you don't mind, could you explain what line of work this is and how it helps the use case? I have encountered similar concepts at my old job in a bank, where programs could save global variables in "containers" (predates docker IIRC) and then other programs could access this.



These days, Mumps is mostly used in elderly systems in the insurance and banking industries. In my case it is banking. I work at Capital One and one of the financial cores we use is Profile from FIS, which is (at least in older versions) built in Mumps.


This is what Intel Optane should have given us.

Non-volatile memory right in the CPU memory map. No "drives", no "controllers", no file allocation tables or lookup lists or inodes. Save to memory <16GB, say, it's volatile: that's for fast-changing variables. Save to memory >16GB and it's there even through reboots.


Isn’t that sort of the original idea for how Forth would work? Everything is just one big memory space and you do whatever you need?

I’m going from very hazy memory here.


I think it is, although you have to manually save the current image if you want to keep the changes you made. Which I find entirely reasonable.

I also think that what gp is looking for is Scratch. IIRC it's a complete graphical Smalltalk environment where everything is saved in one big image file. You change some function, it stays changed.


Arguably you could use GemStone/S like that, though it's probably not the kind of capabilities you want.


It’s not Smalltalk but you might find OS/400 interesting for having a single level store for object persistence.

Old HN discussion with Wikipedia pointers: https://news.ycombinator.com/item?id=18907798


LMDB has a mode designed to do something similar, if anyone wants something like this with just a bit more structure to it like transactional updates via CoW and garbage collection of old versions. It's single writer via a lock but readers are lock/coordination free. A long running read transaction can delay garbage collection however.


Wow. How do design decisions get made that result in these types of situations in the first place?


Someone says "hey if we had 900GB of RAM we could make a lot of money" and then someone else says "that's ridiculous and impossib- hang on a minute" and scurries off to hack together some tech heresy over their lunch break.


By the way, you can find single servers with 32TB of ram nowadays.


Honestly it’s not too far off from what many databases do if they can. They manage one giant file as if it’s their own personal drive of memory and ignore the concept of a filesystem completely.

Obviously that breaks down when you need to span multiple disks, but conceptually it really is quite simple. A lot of the other stuff file systems do are to help keep things consistent. But if there’s only one “file“ and you don’t ever need metadata then you don’t really need that.

Very smart solution really.


Yeah, a lot of database storage engines use O_DIRECT because the OS's general purpose cache heuristics are inferior vs them doing their own buffer pool management. That said if you try this naively you're likely to end up doing something a lot worse than the Linux kernel.


If I had to guess:

Doing it this way = $

Doing it that way = $$$


it's a very reasonable thing to do if you need performance.


I do similar things with mmap and dumping raw structs to get insane speeds one wouldn't expect to get from traditional databases.

Perhaps you could even pause the operations, snapshot with dd and resume everything back in order to get a backup.


Linux had methods to avoid fysnc on filesystems and if you used an SSD and a SAI you would usually have no problems. Pixar used that to write GB's of renders and media for instance.


Very interesting. Can you give a sense of the speed up factor?


I love this one.

If anyone from AWS reads your comment, they could have an idea for a new "product" xD


Sounds fragile, C++ compilers are permitted to do struct padding essentially as they please. A change in compiler could break the SSD<-->struct mapping (i.e. the member offsets).

C++ arrays, on the other hand, are guaranteed not to have padding. That's essentially what memory-mapped IO gives you out of the box.

https://stackoverflow.com/a/5398498

http://www.catb.org/esr/structure-packing/#_structure_alignm...


I'm not familiar with C++ rules (the linked answer seems very suspect to me, their argument being "if you change the struct, the struct changes!"), but they could absolutely just be declaring them as extern "C" to use C's layout rules.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: