When LIMIT 9 works but LIMIT 10 hangs

jwildeboer · on May 31, 2023

What I really like about this story is that free and unfettered access to the standard/specs, the implementation, the ability to spin up a reproducer, to see all moving parts with X-ray vision, come to a solution and having said solution accepted upstream, all in one go, at your own pace AND being able to share the whole story without needing permission - the power of Open. Not working around bugs in whatever proprietary thing. Mervellieux! Excuse my French. #OpenFTW

nine_k · on May 31, 2023

Absolutely.

Once upon a time I encountered some small issue (not a bug) with Postgres. I was able to get to the very source of it in the C code, understand what's happening, and adjust my code accordingly. It took maybe an hour.

Upon doing that, I shuddered, as I remembered troubleshooting an issue (a real confirmed bug) inside Oracle 9. It was by poking in the dark, trying to trigger a condition that made a server process crash, then trying to help Oracle engineers reproduce that. It took a week.

JohnFen · on May 31, 2023

> I was able to get to the very source of it in the C code, understand what's happening, and adjust my code accordingly.

The ability to do this has saved my bacon on more than one occasion. It's why (outside of work I do for an employer) I won't develop using libraries/frameworks that I don't have the source code to.

nine_k · on June 1, 2023

Indeed. Adding a significant closed-source component to the tech stack would require a ton of strong and unique advantages to jusufy it. Very, very few things paas this test.

LorenPechtel · on June 1, 2023

Yup, ages ago, quite a bit of trouble getting Cisco to admit a problem with a router. It should not matter where the traffic is coming from and if anything it would be easier to figure out if I could write code to try to pick the problem out of the firehose. Nope, had to demonstrate it with no code at all before they would look. Finally got enough workstations running dir . /s > nul to cause it to mess up. (I knew it was load related.) And HP wasn't even interested in my file that plotted upside-down. (I suspect even if they had been willing to look they wouldn't have wanted to fix it--there was probably enough buggy hardware out there that they would take the position of declaring it right.)

fnordpiglet · on May 31, 2023

I originally thought this would be a story about query optimizers and their stupidity in the face of the banal. But it was infinitely better! I’ve never had the chance to decrypt tls with wireshark so bookmarking for future reference.

I’m a little surprised at the statement that ~10%+ performance on a transport protocol isn’t worth it. 1 million frames doesn’t seem like that much, and some workloads see these things scaled out as front end servers mostly managing stuff like websockets / JSON / utf nonsense. I’d argue if it’s deep in the kernel of execution and well abstracted in the higher level logic a cryptic bit fiddle with a comment, even perhaps with it’s more common equivalent in the comment, might be worth while.

That leads me to the ultimate question: what was the Pokémon table used for?

gmac · on May 31, 2023

> I’m a little surprised at the statement that ~10%+ performance on a transport protocol isn’t worth it.

Well, depending on which contrast we're making, the performance difference on that one line is either 10ms (by which bit-twiddling is faster than Uint8Array.writeUint16BE) or 100ms (by which Uint8Array.writeUInt16BE is faster than DataView.setUint16) per million frames. I don't think that's going to be anything like 10% of the total CPU time per frame. Unless I've misunderstood your point here?

> That leads me to the ultimate question: what was the Pokémon table used for?

You know what, I just assumed it was a test data table, but I never asked!

fnordpiglet · on May 31, 2023

I guess my time doing low latency work and rust programming makes me see those numbers and want to shave that time off since it’s inside a protocol which you expect to be called millions of times in tight succession potentially dominating the machines performance on a front end node doing nothing but shuttling data. Definitely there will be a lot of other stuff contributing surrounding - but a march of a million miles starts with a single step ;-) certainly in a higher level business logic world I could see not squeezing every cpu cycle and optimizing for clarity of reading, but at very low levels of tight loops (at least in my world) time spent here is time not spent on stuff the developer cares about :-)

norir · on May 31, 2023

Yeah, I don't really understand why they didn't just comment out the high level function call (option 3) and inline option 4 below. Then they could have clarity and performance.

gmac · on May 31, 2023

Sure, that was also an option. Feel free to submit another PR to the maintainers!

tleb_ · on June 1, 2023

What is being talked about is a single 16-bit write inside a byte buffer. The benchmark delta is 10ns per request (90ns for bit-twiddling vs 100ns for writeUInt16BE).

90ns/100ns is a lot for this kind of operation, but that comes from JS' handling of numbers and not a badly optimised routine.

emmelaich · on June 1, 2023

I thought it might have to do with "order by random()"

In a declarative language, random() would get executed once, so it'd be like "order by 4" (nod to xkcd)

gmac · on May 31, 2023

This is ultimately a story about a WebSockets bug and a classic JS footgun in `new DataView()`. I'm the post author: AMA.

projektfu · on May 31, 2023

It seems like NodeJS' Buffer.buffer (->ArrayBuffer) contains a footgun as well, in that you can't rely on the ArrayBuffer layout being identical to what the user of Buffer sees.

mmis1000 · on June 1, 2023

I think the source of problem is in the Uint8Array (or other types arrays) constructor. It has wrong level of abstraction.

Instead of Uint8Array take an ArrayBuffer and an offset and a length. There should instead be something like ArrayBufferView to take the offset and length. And Uint8Array just new with the view, and uint8Array.buffer just return the view. Nothing like missing the offset or length would happen again.

Besides this, missing a intermediate level in between is also a security issue. It means there isn't a way to pass a partial view to a ArrayBuffer safely to other function. Because typedArray.buffer always give you the full ArrayBuffer.

Jupe · on May 31, 2023

Wait, what?

  order by random()

Does that even make sense? Why would anyone ever need this? Better to order by a stored value and randomize the presentation order, IMO.

Seems completely dependent on when random() gets evaluated. If it gets evaluated at every inspection of the row during the sort, I'm little amazed that it hangs some times.

tmoertel · on May 31, 2023

The `ORDER BY RANDOM() LIMIT n` idiom is fairly well established for taking a random sample of uniformly distributed rows. It is also easily augmented to take weighted samples: `WHERE weight > 0 ORDER BY -LN(RANDOM())/weight LIMIT n`.

Queries of this form are easilly parallelized. Most systems that support distributed query processing, for example, optimize ORDER/LIMIT n queries such that each worker returns just the top n rows from its partition. These top rows are then merged in a central worker and limited one last time to produce the final result.

Even so, if your system supports the TABLESAMPLE clause and offers a formulation that lets you specify your actual distribution of interest (TABLESAMPLE SYSTEM often does not), you're probably better off using it.

tmoertel · on May 31, 2023

For anyone who is curious, the weighted sampling logic is the SQL translation of Algorithm A from Efraimidis and Spirakis's 2006 paper [1]:

    Algorithm A
    Input: A population V of weighted items
    Output: A WRS of size n
    1: For each v[i] in V, u[i] = random(0, 1) and k[i] = u^(1/w[i])
    2: Select the n items with the largest keys k[i] as a WRS

Instead of the more direct tanslation of `ORDER BY POW(RANDOM(), (1/weight)) DESC`, we use the convenient fact that logarithm is a monotone transform over the positive reals to take the log of the ORDER BY argument without changing the sort order:

       ORDER BY POW(RANDOM(), (1/weight)) DESC
    => ORDER BY LN(POW(RANDOM(), (1/weight))) DESC  -- LN preserves order.
    => ORDER BY LN(RANDOM()) / weight DESC          -- LN(x^a) = LN(x) * a.
    => ORDER BY -LN(RANDOM()) / weight              -- Order by `x DESC` is same as by `-x`.

[1] Pavlos S Efraimidis and Paul G Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181–185, 2006

funcDropShadow · on June 1, 2023

Thanks for this cool bit of information.

dymk · on May 31, 2023

You'd use order by random() when you want to get a random sample of rows in a table. Without that, you'd end up with rows clustered according to some internal value, probably ordered by the primary key.

Of course, this is a bad idea if you have a table with lots of rows, and you'd be better off rolling an N-sided dice M times, where N is the number of rows in the table.

acatton · on May 31, 2023

But ORDER BY random() will scan the whole table and assign values.

This is the plan I get:

    Limit (cost=56.39..56.41 rows=10 width=40)
    -> Sort (cost=56.39..59.79 rows=1360 width=40)
    Sort Key: (random())
    -> Seq Scan on test (cost=0.00..27.00 rows=1360 width=40)

One should use "FROM table TABLESAMPLE SYSTEM (size)", this is the plan:

    Sample Scan on test (cost=0.00..5.36 rows=136 width=32)
    Sampling: system ('10'::real)

You can also use different distribution method than "SYSTEM", and you can make it reproducible with a seed.

https://www.postgresql.org/docs/current/sql-select.html#SQL-...

ORDER BY random() is a really un-optimized bad practice to get a random sample on PostgreSQL. If you use another SQL database, the best thing to do is to use UUIDv4 as primary key, and use "ORDER BY pk LIMIT size".

tmoertel · on May 31, 2023

Note that `FROM table TABLESAMPLE SYSTEM (n/N)` is a safe substituion for `ORDER BY random() LIMIT n` only when you don't need a sample that's guaranteed to be from a uniform distribution of rows. That's because TABLESAMPLE SYSTEM samples blocks, not individual rows:

> The SYSTEM method does block-level sampling with each block having the specified chance of being selected; all rows in each selected block are returned.

tedunangst · on May 31, 2023

> If you use another SQL database, the best thing to do is to use UUIDv4 as primary key, and use "ORDER BY pk LIMIT size".

What if I want a different random order tomorrow?

acatton · on May 31, 2023

> > If you use another SQL database, the best thing to do is to use UUIDv4 as primary key, and use "ORDER BY pk LIMIT size".

> What if I want a different random order tomorrow?

If your data is large, and you want different samples. You can always always "select * from pk > gen_random_uuid() order by pk limit 10"

You have a very small chance to get ffffffff-... and get zero elements.

This is not perfect, I admit, but ORDER BY random() is one of the most wasteful thing.

If getting random samples is very important, the best you can you is migrate to postgres and use TABLESAMPLE.

__alexs · on May 31, 2023

Then you sort descending of course /s

wolf550e · on May 31, 2023

order by stored uuidv4 will get you the same rows the next time you run it. If you sample using where random() < 0.001 (or something like that) you'll get different rows each time.

cryptonector · on May 31, 2023

> But ORDER BY random() will scan the whole table and assign values.

Er, no, it will assign values to the result set, which may not involve a full table scan.

acatton · on May 31, 2023

Good point, but in the case of OP, without condition, it will do a full table scan.

Also, any case of "WHERE filter ORDER BY random() LIMIT n" means "I want a tiny sample from a broad filter". So it might not be a full table scan, but it will still assign value to a lot of rows.

cryptonector · on May 31, 2023

> Good point, but in the case of OP, without condition, it will do a full table scan.

And that has nothing to do with the ORDER BY random().

tremon · on June 1, 2023

Yes it does, because of LIMIT N. An unordered result set will stop after the first N matches.

cryptonector · on June 2, 2023

Now that is correct.

hans_castorp · on June 1, 2023

> If you use another SQL database

Just a side note: TABLESAMPLE is part of the SQL standard, and e.g. Oracle supports this as well.

Jupe · on May 31, 2023

Agreed. (And thanks; never came across this before... ever. in 20 years)

I guess some projects need to randomly select a few rows from a table? I was thinking of the wikipedia random topic, but that's only one-at-a-time, not a set of records.

kevincox · on May 31, 2023

It does make sense. It gets you a non-repeating sample of the table.

It is probably not very efficient, but semantically it is fairly straightforward. In fact I thought that "hang" was going to mean "took so long I thought it was hung" but that turned out not be the issue here.

acatton · on May 31, 2023

Yeah. It's odd that you got down-voted for this comment. "ORDER BY random() LIMIT 10" is known to be slow on large tables. The correct way to do get a random sample is to use "SELECT ... FROM table TABLESAMPLE SYSTEM (10)" https://www.postgresql.org/docs/current/sql-select.html#SQL-...

You can even use a repeatable seed with this method.

tmoertel · on May 31, 2023

Note that TABLESAMPLE SYSTEM is not guaranteed to give you a uniform sample, as it samples blocks, whereas ORDER BY RANDOM() LIMIT samples uniformly distributed rows. Some database systems support TABLESAMPLE RESERVOIR to guarantee uniform sampling behavior.

awestroke · on May 31, 2023

The hanging query has nothing at all to do with the random ordering. RTFA

Merad · on June 1, 2023

Surprised no one has pointed it out, but the queries involve Pokémon. According to a quick Google search there are a grand total of ~800 different Pokémon. You aren't wrong, but a table scan over 800 rows isn't exactly worth worrying about.

NeoTar · on June 1, 2023

1008 numbered Pokémon as of the latest game (excluding the forthcoming DLC), although the true count is possibly implementation dependent - at the very least some Pokémon have different 'forms' which can have different stats, and so will be stronger/faster/healthier.

The world of Pokémon can be crazily complicated if you let it ;-).

JdeBP · on May 31, 2023

What interests me is why the developers originally thought that interposing a DataView() was a speed improvement, and why the benchmark results at https://github.com/nodejs/performance/issues/2 are at odds with the author's benchmarks in the headlined article.

Thorrez · on May 31, 2023

That link seems to be doing a bunch of writes on the same buffer, but in the blog it's just doing a single write on each buffer.

barrkel · on May 31, 2023

Rediscovering memory unsafety through performance chasing. Ah well.

jjnoakes · on May 31, 2023

> That 16-bit payload length is actually written somewhere else, as bytes 3 and 4 of Buffer.allocUnsafe()’s global pool. These bytes may well be part of another active buffer, and thus we also stand a good chance of corrupting data elsewhere.

No kidding.

JBiserkov · on May 31, 2023

Indeed!

> the use of DataView here was an optimization that, though intended to speed things up, actually slowed things down … and broke them utterly in the process.

munchler · on May 31, 2023

Seriously. Any design that encourages aliasing of mutable memory like this is just asking for trouble. This is a lesson that the functional programming community has already internalized, but the rest of the world sadly has not.

MBCook · on June 1, 2023

Reading about that and the JS runtime just handing out a view into a shared mutable buffer just floored me.

That’s so clear dangerous from both a bug and security perspective.

lanstin · on May 31, 2023

First thing I do when I join a group that has code that writes and reads bytes from the wire is write really big stress tests and keep fixing things till it has 100% reliability, short messages, long messages, slow messages, pipe lined messageds, etc. etc. There's so many bugs where the code fails if len > N, len % N == 0, len < small N, len < what is read, etc. Like does this mean the driver had never worked for len > 126 or whatever?

gmac · on May 31, 2023

Yeah: interestingly, they had a test for the biggest category of frame, but not for the two other categories: https://github.com/nodejs/undici/blob/main/test/websocket/se...

The test I contributed is very specific to the frame fix I made, but I should probably go back and contribute more tests in send.js that test other lengths too.

hinkley · on May 31, 2023

Don't forget the perennial classic: len > N && len % N == 1

Due to off by one error, the last byte of a message is never sent if the message is larger than the buffer or the packet size (or not sent until the next message arrives, for stream protocols)

lanstin · on May 31, 2023

Yeah, I had a djb net parser my (work) friends wrote that would end up dropping the connection if the last byte read was the byte before the separating comma. https://en.wikipedia.org/wiki/Netstring

Only happened with very large database reads so that the 1440 byte segments hit that spot. New manager and he started mocking me during team meetings on why I was so obsessed with these load tests succeeding and then was amazed at the 2 char fix (adding +1 to the read till length call).

pacaro · on May 31, 2023

This. And profile guided fuzz testing. Ideally every basis path that serializes/deserializes should be tested

Izkata · on May 31, 2023

Oh hell I think I encountered almost this same bug a few weeks ago, but in python code. A postgres query suddenly stopped working and while fiddling with it I figured out it was only when over the network (local psql worked), and only when the query was long enough. The "fix" I settled on without figuring out the root cause was to change an "AS foobar" to "AS foo" in the SELECT, reducing the total query size.

londons_explore · on May 31, 2023

For the performance footnote...

I think the misunderstanding comes from the fact calling .write() on sockets is pretty expensive. Specifically, if you write a byte or two, then the OS might send a whole packet out onto the network with just those two bytes.

If you did that for every tiny websocket message, then you would be sending double the number of packets necessary, with matching CPU overhead. Double the chance of network losses causing head of line blocking too.

So... Calling .write() should be done for large buffers and as rarely as possible.

However, in this case, .writeUInt16BE() doesn't send anything onto the network, but presumably the original author has been burned by .write() being expensive before...

iudqnolq · on June 1, 2023

I think the author expected that creating a u16 view over a buffer up front and then writing to it would perform better than writing one-off to a single pair as a u16.

If they're gonna optimize further I think they may as well go all the way through. This'll be pretty readable in the context of a binary serializer codebase

    b[2] = value >>> 8;
    b[3] = value & 0xff;

withinboredom · on June 1, 2023

Sounds like you've re-invented Nagle's algorithm.

mgaunard · on June 1, 2023

I don't get this whole format of going slowly through one person's rambling.

Some software which converts some TCP connections to WebSocket didn't work because it didn't do the WebSocket framing (which is just prepending the size of each message) correctly.

The bug in question wasn't even well elaborated on as most of the article is spent on looking at what might be going wrong, even though basic unit testing of the framing for its various cases (WebSocket has different framing for different payload sizes) would have triggered the bug immediately.

Jarred · on June 1, 2023

DataView is JIT optimized by both V8 and JSC. Buffer’s write* functions are not, and therefore are slower. The faster but even more complicated approach would likely be to create a single DataView per ArrayBuffer and then set the byteOffset accordingly. Internally, Bun’s Buffer implementation does something like this (it was faster than a C++ implementation as well as a bit-shifting in JS implementation)

blindriver · on June 1, 2023

The first thing I would have done is run the query on the database itself. If LIMIT 10 didn't encounter any issues, then that would immediately make me assume that the amount of data was causing problems on the web layer.

Then it's just a matter of tracing the response back from the call to the database backwards until I find the break.

williamdclt · on May 31, 2023

You can takeaway learnings about web sockets, about DataFrame, about buffers and allocation but…

To me, the root cause is “optional arguments are evil”. I don’t blame the undici people, I blame the API design of DataFrame that makes it easier for the dev to not think about these critical parameters that to do it.

Optional arguments are much more often than not the wrong API.

rmbyrro · on May 31, 2023

Why do we seem to like torturing ourselves?..

revskill · on May 31, 2023

Neon open source version is too "unstable" to me though.

nikita · on May 31, 2023

How do you run it?

revskill · on May 31, 2023

I use the docker-compose provided via GIthub.

charcircuit · on May 31, 2023

I'm surprised there was no mention of "query planner" in the article. That would be the first thing I would check instead of assuming the driver had a bug.

tremon · on June 1, 2023

A database performance issue was ruled out earlier in the process:

> Node 17 and below worked, Node 18 and above didn’t.

Given that the same queries worked fine with a different client, that would point to a client issue rather than a server issue.

101011 · on May 31, 2023

If it was the query planner than I suppose the protocol wouldn't have been the way to make it fail

charcircuit · on June 1, 2023

Sometimes when you think you only changed one variable during a time span, multiple variables may have changed.

remram · on June 1, 2023

What I read in the story is this twice:

> SELECT * FROM pokemon LIMIT [copy code button]

nikita · on May 31, 2023

Neon CEO here. Happy to answer questions about the serverless driver and beyond.

re-thc · on May 31, 2023

How does the serverless driver and connection pooling impact pricing? Will it keep instances active and hence run for all of the month?

nikita · on May 31, 2023

No, the compute shuts down on inactivity in 5 minutes regardless of active connections.

re-thc · on May 31, 2023

AWS released Aurora I/O-Optimized that for a higher storage cost no longer charges for I/O (quoted 40% savings).

Is such a mode being considered for Neon? If not, will there be future adjustments in pricing?

ddorian43 · on May 31, 2023

I see you support timescaledb extension.

How does that work? Is a "chunk" in timescaledb split into "pages" and just streamed from S3?

nikita · on May 31, 2023

Neon works at the storage level: https://neon.tech/docs/introduction/architecture-overview

So all extensions including Timescale work. Postgres does the heavy lifting

btown · on May 31, 2023

Does this affect ACID guarantees at all? Could Postgres ever expect to read back a page with currently-changing data from the Pageserver before those changes get propagated there? Will the Pageserver cluster block I/O if it’s waiting on a page to reach a consistent state?

nikita · on May 31, 2023

Neon is as consistent as Postgres. B/c it's Postgres. All we did was swapping storage.

meghan_rain · on June 1, 2023

tldr?

irrational · on May 31, 2023

This is the first time I’ve ever seen anyone describe themselves as a “typescript developer”. Though, I do recall hearing people call themselves Java developers in the past. I guess I’m more used to terms like senior developer, backend developer, web developer, etc. rather than a particular language developer.

phpnode · on May 31, 2023

Even worse, I’m a typescript consultant

jjeaff · on May 31, 2023

A web/frontend developer is always a typescript developer if the job you are applying for is looking for someone with typescript experience.