Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Most of Nim’s standard library collections are based on hashing and only offer O(1) behavior on the average case. Usually this is not good enough for a hard real-time setting and so these have to be avoided. In fact, throughout Nim’s history people found cases of pathologically bad performance for these data structures. Instead containers based on BTrees can be used that offer O(log N) operations for lookup/insertion/deletion.

I'm surprised that hash maps can stray far enough from O(1) to become unusable in hard real-time. Is it because RT systems have to use lower-quality PRNGs?



Probably a better way of looking at it is hard real time is hard, in several senses of the term. You don't really want any operations that have variable time at all. If that feels like a really difficult specification to work with, it is. Soft real time is more common. Soft real time theoretically has the same considerations, but in general you have enough performance that you can wave away many of these difficulties; yes, technically my hash table is not 100% reliable every time but as I can show it'll never have more than 128 entries, and I've got 25ms targets for this task I'm doing, and I've got 32MB when my code only uses about 4 max, and these numbers are all large enough that I can just not care about the resizing operation. I've got many systems I've written that are just straight-up accidentally soft-real-time in practice simply because running them on even a moderately-powerful modern PC is a hundred times more power than they really need, and they're already 10-50 times faster than they "need" to be even before optimizing them, so barring pathological network performance that would trash any system, they're essentially soft real time already. You don't back your way into hard real time, though.


>I've got many systems I've written that are just straight-up accidentally soft-real-time in practice simply because running them on even a moderately-powerful modern PC is a hundred times more power than they really need, and they're already 10-50 times faster than they "need" to be even before optimizing them, so barring pathological network performance that would trash any system, they're essentially soft real time already.

But the categorization of real-time depends on the requirements of the problem, not how difficult it is to meet them. If your problem tolerates a few missed deadlines every now and then and your hardware is such that a naive implementation misses deadlines so frequently that it's useless, and so you need to optimize it thoroughly to get the misses under control, it's still soft real-time. It doesn't become soft real-time when technology advances enough that even an inefficient implementation will tolerably a deadline only occasionally.


When targeting hard real time you want strong worst-case guarantees, not weak probabilistic arguments like "we hope the hashes are distributed well enough to achieve O(1) in practice".

The malicious version of this is hash DoS, where an attacker chooses the values inserted into the hash table so they have the same hash (or at least same bucket). Randomized hashing using a secure hash mitigates this somewhat, but the attack is still possible to some degree, especially if the hash secret is long lived.

Another issue is that inserting into hash tables is usually only amortized O(1), with the collection growing exponentially when it's full. So most insert operations are cheap, but occasionally you get an O(n) insertion. However it doesn't sound like this is what the article is talking about.


While I agree with all you said, one can expand a bit on it. There are concrete max search depth expectations for well randomizing hash functions such as:

    https://www.sciencedirect.com/science/article/abs/pii/019667748790040X
which is notably logarithmic - not unlike a B-Tree.

When these expectations are exceeded you can at least detect a DoS attack. If you wait until such are seen, you can activate a "more random" mitigation on the fly at about the same cost as "the next resize/re-org/whatnot".

All you need to do is instrument your search to track the depth. There is some example such strategy in Nim at https://github.com/c-blake/adix for simple Robin-Hood Linear Probed tables and a formula based one from that above paper inside https://github.com/c-blake/suggest

This is all more about the "attack" scenario than the "truly hard" real-time scenario, of course.


Having a better PRNG shouldn't help worst case should it?

And worst case is the only thing that matters in a hard real time system.


The thing is that in this scenario you need to consider the worst case access, not the average. So instead of O(1) accesses you're stuck with O(n).


I did some work on Nim's hash tables back in 2020, specifically with OrderedTable, comparable to a Python dict where insertion order is preserved. I stumbled on this table module in a roundabout way, via Nim's database module, db_sqlite. The db_sqlite module was much slower than Python for simple tests, and on investigation, I found that it didn't automatically handled prepared statement caching like Python's sqlite3 module. There were some other issues with db_sqlite, like blob handling and null handling, which led me to a different SQLite interface, tiny_sqlite. This was a big improvement, handling both nulls and blobs, and the developer was great to work with. But it also didn't support prepared statement caching. I filed an issue and he implemented it, using Nim's OrderedTable to simulate an LRU cache by adding a new prepared statement and deleting the oldest one if the cache was too big:

https://github.com/GULPF/tiny_sqlite/issues/3

Performance was hugely improved. There was another LRUCache implementation I played with, and when using that for the statement cache, performance was 25% faster than OrderedTable. That didn't make much sense to me for a 100-entry hash table, so I started running some tests comparing LRUCache and OrderedTable. What I discovered is that OrderedTable delete operations created an entirely new copy of the table, minus the entry being deleted, on every delete. That seemed pretty crazy, especially since it was already showing up as performance problems in a 100-entry table.

The tiny_sqlite developer switched to LRUCache, and I did some work on the OrderedTable implementation to make deletes O(1) as expected with hash table operations:

https://github.com/nim-lang/Nim/pull/14995

After spending a lot of time on this, I finally gave up. The problems were:

- the JSON implementation used OrderedTables and never did deletes. JSON benchmark performance was rather sacred, so changing OrderedTables to be slightly slower/larger (I used a doubly-linked list) was not desirable, even if it changed delete performance from O(n) to O(1)

- the Nim compiler also used OrderedTables and never did deletes

- Nim tables allowed multiple values for the same key (I did help get that deprecated).

- alternatives were proposed by others that maintained insertion order until a deleted occurred, but then it could become unordered. That made no sense to me.

The TLDR is, if you use Nim tables, don't use OrderedTable unless you can afford to make an copy of the table on every deleted.

Current Nim OrderedTable delete code: https://github.com/nim-lang/Nim/blob/15bffc20ed8da26e68c88bb...

Issue for db_sqlite not handling nulls, blobs, statement cache: https://github.com/nim-lang/Nim/issues/13559




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: