Hacker News new | past | comments | ask | show | jobs | submit login
When LIMIT 9 works but LIMIT 10 hangs (neon.tech)
289 points by gmac on May 31, 2023 | hide | past | favorite | 80 comments



What I really like about this story is that free and unfettered access to the standard/specs, the implementation, the ability to spin up a reproducer, to see all moving parts with X-ray vision, come to a solution and having said solution accepted upstream, all in one go, at your own pace AND being able to share the whole story without needing permission - the power of Open. Not working around bugs in whatever proprietary thing. Mervellieux! Excuse my French. #OpenFTW


Absolutely.

Once upon a time I encountered some small issue (not a bug) with Postgres. I was able to get to the very source of it in the C code, understand what's happening, and adjust my code accordingly. It took maybe an hour.

Upon doing that, I shuddered, as I remembered troubleshooting an issue (a real confirmed bug) inside Oracle 9. It was by poking in the dark, trying to trigger a condition that made a server process crash, then trying to help Oracle engineers reproduce that. It took a week.


> I was able to get to the very source of it in the C code, understand what's happening, and adjust my code accordingly.

The ability to do this has saved my bacon on more than one occasion. It's why (outside of work I do for an employer) I won't develop using libraries/frameworks that I don't have the source code to.


Indeed. Adding a significant closed-source component to the tech stack would require a ton of strong and unique advantages to jusufy it. Very, very few things paas this test.


Yup, ages ago, quite a bit of trouble getting Cisco to admit a problem with a router. It should not matter where the traffic is coming from and if anything it would be easier to figure out if I could write code to try to pick the problem out of the firehose. Nope, had to demonstrate it with no code at all before they would look. Finally got enough workstations running dir . /s > nul to cause it to mess up. (I knew it was load related.) And HP wasn't even interested in my file that plotted upside-down. (I suspect even if they had been willing to look they wouldn't have wanted to fix it--there was probably enough buggy hardware out there that they would take the position of declaring it right.)


I originally thought this would be a story about query optimizers and their stupidity in the face of the banal. But it was infinitely better! I’ve never had the chance to decrypt tls with wireshark so bookmarking for future reference.

I’m a little surprised at the statement that ~10%+ performance on a transport protocol isn’t worth it. 1 million frames doesn’t seem like that much, and some workloads see these things scaled out as front end servers mostly managing stuff like websockets / JSON / utf nonsense. I’d argue if it’s deep in the kernel of execution and well abstracted in the higher level logic a cryptic bit fiddle with a comment, even perhaps with it’s more common equivalent in the comment, might be worth while.

That leads me to the ultimate question: what was the Pokémon table used for?


> I’m a little surprised at the statement that ~10%+ performance on a transport protocol isn’t worth it.

Well, depending on which contrast we're making, the performance difference on that one line is either 10ms (by which bit-twiddling is faster than Uint8Array.writeUint16BE) or 100ms (by which Uint8Array.writeUInt16BE is faster than DataView.setUint16) per million frames. I don't think that's going to be anything like 10% of the total CPU time per frame. Unless I've misunderstood your point here?

> That leads me to the ultimate question: what was the Pokémon table used for?

You know what, I just assumed it was a test data table, but I never asked!


I guess my time doing low latency work and rust programming makes me see those numbers and want to shave that time off since it’s inside a protocol which you expect to be called millions of times in tight succession potentially dominating the machines performance on a front end node doing nothing but shuttling data. Definitely there will be a lot of other stuff contributing surrounding - but a march of a million miles starts with a single step ;-) certainly in a higher level business logic world I could see not squeezing every cpu cycle and optimizing for clarity of reading, but at very low levels of tight loops (at least in my world) time spent here is time not spent on stuff the developer cares about :-)


Yeah, I don't really understand why they didn't just comment out the high level function call (option 3) and inline option 4 below. Then they could have clarity and performance.


Sure, that was also an option. Feel free to submit another PR to the maintainers!


What is being talked about is a single 16-bit write inside a byte buffer. The benchmark delta is 10ns per request (90ns for bit-twiddling vs 100ns for writeUInt16BE).

90ns/100ns is a lot for this kind of operation, but that comes from JS' handling of numbers and not a badly optimised routine.


I thought it might have to do with "order by random()"

In a declarative language, random() would get executed once, so it'd be like "order by 4" (nod to xkcd)


This is ultimately a story about a WebSockets bug and a classic JS footgun in `new DataView()`. I'm the post author: AMA.


It seems like NodeJS' Buffer.buffer (->ArrayBuffer) contains a footgun as well, in that you can't rely on the ArrayBuffer layout being identical to what the user of Buffer sees.


I think the source of problem is in the Uint8Array (or other types arrays) constructor. It has wrong level of abstraction.

Instead of Uint8Array take an ArrayBuffer and an offset and a length. There should instead be something like ArrayBufferView to take the offset and length. And Uint8Array just new with the view, and uint8Array.buffer just return the view. Nothing like missing the offset or length would happen again.

Besides this, missing a intermediate level in between is also a security issue. It means there isn't a way to pass a partial view to a ArrayBuffer safely to other function. Because typedArray.buffer always give you the full ArrayBuffer.


Wait, what?

  order by random()
Does that even make sense? Why would anyone ever need this? Better to order by a stored value and randomize the presentation order, IMO.

Seems completely dependent on when random() gets evaluated. If it gets evaluated at every inspection of the row during the sort, I'm little amazed that it hangs some times.


The `ORDER BY RANDOM() LIMIT n` idiom is fairly well established for taking a random sample of uniformly distributed rows. It is also easily augmented to take weighted samples: `WHERE weight > 0 ORDER BY -LN(RANDOM())/weight LIMIT n`.

Queries of this form are easilly parallelized. Most systems that support distributed query processing, for example, optimize ORDER/LIMIT n queries such that each worker returns just the top n rows from its partition. These top rows are then merged in a central worker and limited one last time to produce the final result.

Even so, if your system supports the TABLESAMPLE clause and offers a formulation that lets you specify your actual distribution of interest (TABLESAMPLE SYSTEM often does not), you're probably better off using it.


For anyone who is curious, the weighted sampling logic is the SQL translation of Algorithm A from Efraimidis and Spirakis's 2006 paper [1]:

    Algorithm A
    Input: A population V of weighted items
    Output: A WRS of size n
    1: For each v[i] in V, u[i] = random(0, 1) and k[i] = u^(1/w[i])
    2: Select the n items with the largest keys k[i] as a WRS
Instead of the more direct tanslation of `ORDER BY POW(RANDOM(), (1/weight)) DESC`, we use the convenient fact that logarithm is a monotone transform over the positive reals to take the log of the ORDER BY argument without changing the sort order:

       ORDER BY POW(RANDOM(), (1/weight)) DESC
    => ORDER BY LN(POW(RANDOM(), (1/weight))) DESC  -- LN preserves order.
    => ORDER BY LN(RANDOM()) / weight DESC          -- LN(x^a) = LN(x) * a.
    => ORDER BY -LN(RANDOM()) / weight              -- Order by `x DESC` is same as by `-x`.
   
[1] Pavlos S Efraimidis and Paul G Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181–185, 2006


Thanks for this cool bit of information.


You'd use order by random() when you want to get a random sample of rows in a table. Without that, you'd end up with rows clustered according to some internal value, probably ordered by the primary key.

Of course, this is a bad idea if you have a table with lots of rows, and you'd be better off rolling an N-sided dice M times, where N is the number of rows in the table.


But ORDER BY random() will scan the whole table and assign values.

This is the plan I get:

    Limit (cost=56.39..56.41 rows=10 width=40)
    -> Sort (cost=56.39..59.79 rows=1360 width=40)
    Sort Key: (random())
    -> Seq Scan on test (cost=0.00..27.00 rows=1360 width=40)
One should use "FROM table TABLESAMPLE SYSTEM (size)", this is the plan:

    Sample Scan on test (cost=0.00..5.36 rows=136 width=32)
    Sampling: system ('10'::real)
You can also use different distribution method than "SYSTEM", and you can make it reproducible with a seed.

https://www.postgresql.org/docs/current/sql-select.html#SQL-...

ORDER BY random() is a really un-optimized bad practice to get a random sample on PostgreSQL. If you use another SQL database, the best thing to do is to use UUIDv4 as primary key, and use "ORDER BY pk LIMIT size".


Note that `FROM table TABLESAMPLE SYSTEM (n/N)` is a safe substituion for `ORDER BY random() LIMIT n` only when you don't need a sample that's guaranteed to be from a uniform distribution of rows. That's because TABLESAMPLE SYSTEM samples blocks, not individual rows:

> The SYSTEM method does block-level sampling with each block having the specified chance of being selected; all rows in each selected block are returned.


> If you use another SQL database, the best thing to do is to use UUIDv4 as primary key, and use "ORDER BY pk LIMIT size".

What if I want a different random order tomorrow?


> > If you use another SQL database, the best thing to do is to use UUIDv4 as primary key, and use "ORDER BY pk LIMIT size".

> What if I want a different random order tomorrow?

If your data is large, and you want different samples. You can always always "select * from pk > gen_random_uuid() order by pk limit 10"

You have a very small chance to get ffffffff-... and get zero elements.

This is not perfect, I admit, but ORDER BY random() is one of the most wasteful thing.

If getting random samples is very important, the best you can you is migrate to postgres and use TABLESAMPLE.


Then you sort descending of course /s


order by stored uuidv4 will get you the same rows the next time you run it. If you sample using where random() < 0.001 (or something like that) you'll get different rows each time.


> But ORDER BY random() will scan the whole table and assign values.

Er, no, it will assign values to the result set, which may not involve a full table scan.


Good point, but in the case of OP, without condition, it will do a full table scan.

Also, any case of "WHERE filter ORDER BY random() LIMIT n" means "I want a tiny sample from a broad filter". So it might not be a full table scan, but it will still assign value to a lot of rows.


> Good point, but in the case of OP, without condition, it will do a full table scan.

And that has nothing to do with the ORDER BY random().


Yes it does, because of LIMIT N. An unordered result set will stop after the first N matches.


Now that is correct.


> If you use another SQL database

Just a side note: TABLESAMPLE is part of the SQL standard, and e.g. Oracle supports this as well.


Agreed. (And thanks; never came across this before... ever. in 20 years)

I guess some projects need to randomly select a few rows from a table? I was thinking of the wikipedia random topic, but that's only one-at-a-time, not a set of records.


It does make sense. It gets you a non-repeating sample of the table.

It is probably not very efficient, but semantically it is fairly straightforward. In fact I thought that "hang" was going to mean "took so long I thought it was hung" but that turned out not be the issue here.


Yeah. It's odd that you got down-voted for this comment. "ORDER BY random() LIMIT 10" is known to be slow on large tables. The correct way to do get a random sample is to use "SELECT ... FROM table TABLESAMPLE SYSTEM (10)" https://www.postgresql.org/docs/current/sql-select.html#SQL-...

You can even use a repeatable seed with this method.


Note that TABLESAMPLE SYSTEM is not guaranteed to give you a uniform sample, as it samples blocks, whereas ORDER BY RANDOM() LIMIT samples uniformly distributed rows. Some database systems support TABLESAMPLE RESERVOIR to guarantee uniform sampling behavior.


The hanging query has nothing at all to do with the random ordering. RTFA


Surprised no one has pointed it out, but the queries involve Pokémon. According to a quick Google search there are a grand total of ~800 different Pokémon. You aren't wrong, but a table scan over 800 rows isn't exactly worth worrying about.


1008 numbered Pokémon as of the latest game (excluding the forthcoming DLC), although the true count is possibly implementation dependent - at the very least some Pokémon have different 'forms' which can have different stats, and so will be stronger/faster/healthier.

The world of Pokémon can be crazily complicated if you let it ;-).


What interests me is why the developers originally thought that interposing a DataView() was a speed improvement, and why the benchmark results at https://github.com/nodejs/performance/issues/2 are at odds with the author's benchmarks in the headlined article.


That link seems to be doing a bunch of writes on the same buffer, but in the blog it's just doing a single write on each buffer.


Rediscovering memory unsafety through performance chasing. Ah well.


> That 16-bit payload length is actually written somewhere else, as bytes 3 and 4 of Buffer.allocUnsafe()’s global pool. These bytes may well be part of another active buffer, and thus we also stand a good chance of corrupting data elsewhere.

No kidding.


Indeed!

> the use of DataView here was an optimization that, though intended to speed things up, actually slowed things down … and broke them utterly in the process.


Seriously. Any design that encourages aliasing of mutable memory like this is just asking for trouble. This is a lesson that the functional programming community has already internalized, but the rest of the world sadly has not.


Reading about that and the JS runtime just handing out a view into a shared mutable buffer just floored me.

That’s so clear dangerous from both a bug and security perspective.


First thing I do when I join a group that has code that writes and reads bytes from the wire is write really big stress tests and keep fixing things till it has 100% reliability, short messages, long messages, slow messages, pipe lined messageds, etc. etc. There's so many bugs where the code fails if len > N, len % N == 0, len < small N, len < what is read, etc. Like does this mean the driver had never worked for len > 126 or whatever?


Yeah: interestingly, they had a test for the biggest category of frame, but not for the two other categories: https://github.com/nodejs/undici/blob/main/test/websocket/se...

The test I contributed is very specific to the frame fix I made, but I should probably go back and contribute more tests in send.js that test other lengths too.


Don't forget the perennial classic: len > N && len % N == 1

Due to off by one error, the last byte of a message is never sent if the message is larger than the buffer or the packet size (or not sent until the next message arrives, for stream protocols)


Yeah, I had a djb net parser my (work) friends wrote that would end up dropping the connection if the last byte read was the byte before the separating comma. https://en.wikipedia.org/wiki/Netstring

Only happened with very large database reads so that the 1440 byte segments hit that spot. New manager and he started mocking me during team meetings on why I was so obsessed with these load tests succeeding and then was amazed at the 2 char fix (adding +1 to the read till length call).


This. And profile guided fuzz testing. Ideally every basis path that serializes/deserializes should be tested


Oh hell I think I encountered almost this same bug a few weeks ago, but in python code. A postgres query suddenly stopped working and while fiddling with it I figured out it was only when over the network (local psql worked), and only when the query was long enough. The "fix" I settled on without figuring out the root cause was to change an "AS foobar" to "AS foo" in the SELECT, reducing the total query size.


For the performance footnote...

I think the misunderstanding comes from the fact calling .write() on sockets is pretty expensive. Specifically, if you write a byte or two, then the OS might send a whole packet out onto the network with just those two bytes.

If you did that for every tiny websocket message, then you would be sending double the number of packets necessary, with matching CPU overhead. Double the chance of network losses causing head of line blocking too.

So... Calling .write() should be done for large buffers and as rarely as possible.

However, in this case, .writeUInt16BE() doesn't send anything onto the network, but presumably the original author has been burned by .write() being expensive before...


I think the author expected that creating a u16 view over a buffer up front and then writing to it would perform better than writing one-off to a single pair as a u16.

If they're gonna optimize further I think they may as well go all the way through. This'll be pretty readable in the context of a binary serializer codebase

    b[2] = value >>> 8;
    b[3] = value & 0xff;


Sounds like you've re-invented Nagle's algorithm.


I don't get this whole format of going slowly through one person's rambling.

Some software which converts some TCP connections to WebSocket didn't work because it didn't do the WebSocket framing (which is just prepending the size of each message) correctly.

The bug in question wasn't even well elaborated on as most of the article is spent on looking at what might be going wrong, even though basic unit testing of the framing for its various cases (WebSocket has different framing for different payload sizes) would have triggered the bug immediately.


DataView is JIT optimized by both V8 and JSC. Buffer’s write* functions are not, and therefore are slower. The faster but even more complicated approach would likely be to create a single DataView per ArrayBuffer and then set the byteOffset accordingly. Internally, Bun’s Buffer implementation does something like this (it was faster than a C++ implementation as well as a bit-shifting in JS implementation)


The first thing I would have done is run the query on the database itself. If LIMIT 10 didn't encounter any issues, then that would immediately make me assume that the amount of data was causing problems on the web layer.

Then it's just a matter of tracing the response back from the call to the database backwards until I find the break.


You can takeaway learnings about web sockets, about DataFrame, about buffers and allocation but…

To me, the root cause is “optional arguments are evil”. I don’t blame the undici people, I blame the API design of DataFrame that makes it easier for the dev to not think about these critical parameters that to do it.

Optional arguments are much more often than not the wrong API.


Why do we seem to like torturing ourselves?..


Neon open source version is too "unstable" to me though.


How do you run it?


I use the docker-compose provided via GIthub.


I'm surprised there was no mention of "query planner" in the article. That would be the first thing I would check instead of assuming the driver had a bug.


A database performance issue was ruled out earlier in the process:

> Node 17 and below worked, Node 18 and above didn’t.

Given that the same queries worked fine with a different client, that would point to a client issue rather than a server issue.


If it was the query planner than I suppose the protocol wouldn't have been the way to make it fail


Sometimes when you think you only changed one variable during a time span, multiple variables may have changed.


What I read in the story is this twice:

> SELECT * FROM pokemon LIMIT [copy code button]


Neon CEO here. Happy to answer questions about the serverless driver and beyond.


How does the serverless driver and connection pooling impact pricing? Will it keep instances active and hence run for all of the month?


No, the compute shuts down on inactivity in 5 minutes regardless of active connections.


AWS released Aurora I/O-Optimized that for a higher storage cost no longer charges for I/O (quoted 40% savings).

Is such a mode being considered for Neon? If not, will there be future adjustments in pricing?


I see you support timescaledb extension.

How does that work? Is a "chunk" in timescaledb split into "pages" and just streamed from S3?


Neon works at the storage level: https://neon.tech/docs/introduction/architecture-overview

So all extensions including Timescale work. Postgres does the heavy lifting


Does this affect ACID guarantees at all? Could Postgres ever expect to read back a page with currently-changing data from the Pageserver before those changes get propagated there? Will the Pageserver cluster block I/O if it’s waiting on a page to reach a consistent state?


Neon is as consistent as Postgres. B/c it's Postgres. All we did was swapping storage.


tldr?


This is the first time I’ve ever seen anyone describe themselves as a “typescript developer”. Though, I do recall hearing people call themselves Java developers in the past. I guess I’m more used to terms like senior developer, backend developer, web developer, etc. rather than a particular language developer.


Even worse, I’m a typescript consultant


A web/frontend developer is always a typescript developer if the job you are applying for is looking for someone with typescript experience.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: