Erasure Coding versus Tail Latency

ot · 2024-03-28T15:37:06 1711640226

It is worth noting that this does not come for free, and it would have been nice for the article to mention the trade-off: reconstruction is not cheap on CPU, if you use something like Reed-Solomon.

Usually the codes used for erasure coding are in systematic form: there are k "preferential" parts out of M that are just literal fragments of the original blob, so if you get those you can just concatenate them to get the original data. If you get any other k-subset, you need to perform expensive reconstruction.

mjb · 2024-03-28T15:42:39 1711640559

It's true that Reed-Solomon isn't free.

The first two codes (N of N+1 and N of N+2) are nearly trivial and can be done very fast indeed. On my hardware, the N of N+1 code (which is an XOR) can be arranged to be nearly as fast a memcpy (which obviously isn't free either). They can also be done in a streaming way which can save the memcpy if you're feeding a stream into a parser (e.g. JSON or something) or decryption.

> Usually the codes used for erasure coding are in systematic form: there are k "preferential" parts out of M that are just literal fragments of the original blob, so if you get those you can just concatenate them to get the original data.

Yeah, that's true. If you're CPU bound, it may be worth waiting a little longer for these 'diagonal' components to come back.

dragontamer · 2024-03-28T18:17:00 1711649820

Reed Solomon is closer to "perfect" but is unnecessary.

IIRC, Turbo Codes and LDPCs are less-perfect (they cannot offer strict guarantees like Reed-Solomon can), but as XOR-based simple operations, they are extremely extremely fast to implement.

LDPC has high-probabilities of fixing errors (near Reed-Solomon level), which is good enough in practice. Especially since LDPC's simple XOR-based operation is far faster and like O(n) instead of Reed-Solomon's matrix-multiplication (O(n^2)) algorithm.

The state of the art has moved forward. Reed Solomon is great for proving the practice and providing strict assurances (likely better for storage where you have strict size limits and need strong guarantees for MTBF or other such statistics). But for a "faster" algorithm (ie: trying to prevent repeated packets in a communication stream like TCP or similar protocol), LDPC and/or Turbo codes are likely a better solution.

-----

Reed Solomon is probably best for "smaller" codes where the matrix is smaller and O(n^2) hasn't gotten out of hand yet. But as codes increase in size, the O(n) "less than perfect" codes (such as Turbo codes or LDPC codes) become better-and-better ideas.

That being said: I can imagine some crazy GPU / SIMD algorithm where we have such cheap compute and low bandwidth where the O(n^2) operation might serve as a better basis than the cheap XOR operation. The future of computers is going to be more compute and less relative memory bandwidth after all, so the pendulum may swing the other way depending on how future machines end up.

aero_code · 2024-03-28T21:06:51 1711660011

I'm interested in using LDPC (or Turbo Codes) in software for error correction, but most of the resources I've found only cover soft-decision LDPC. When I've found LDPC papers, it's hard for me to know how efficient the algorithms are and whether it's worth spending time on them. Reed-Solomon has more learning resources that are often more approachable (plus open source libraries). Do you have more information on how to implement LDPC decoding using XOR-based operations?

dragontamer · 2024-03-28T21:38:17 1711661897

Alas, no. I'm mostly spitballing here since I know that XOR-based codes are obviously much faster than Reed-solomon erasure code / matrix solving methodology.

There's a lot of names thrown out for practical LDPC erasure codes. Raptor Codes, Tornado Codes, and the like. Hopefully those names can give you a good starting point?

EDIT: I also remember reading a paper on a LDPC Fountain Code (ex: keep sending data + LDPC checkbits until the other side got enough to reconstruct the data), as a kind of "might as well keep sending data while waiting for the ACK", kind of thing, which should cut down on latency.

--------

I'm personally on the "Finished reading my book on Reed-Solomon codes. Figuring out what to study next" phase. There's a lot of codes out there, and LDPC is a huge class...

Then again, the project at work that I had that benefited from these error (erm... erasure) correcting codes was complete and my Reed-solomon implementation was good enough and doesn't really need to be touched anymore. So its not like I have a real reason to study this stuff anymore. Just a little bit of extra data that allowed the protocol to cut off some latency and reduce the number of resends in a very noisy channel we had. The good ol' "MVP into shelved code" situation, lol. Enough code to prove it works, made a nice demo that impressed the higher-ups, and then no one ended up caring for the idea.

If I were to productize the concept, I'd research these more modern, faster codes (like LDPC, Raptor, Tornado, etc. etc.) and implement a state-of-the-art erasure correction solution, ya know? But at this point, the projects just dead.

But honestly, the blog-post's situation (cut down on latency with forward error correction) is seemingly a common problem that's solved again and again in our industry. But at the same time, there's so much to learn in the world of Comp. Sci that sometimes its important to "be lazy" and "learn it when its clearly going to be useful" (and not learning it to hypothetically improve a dead project, lol).

aero_code · 2024-03-30T01:13:01 1711761181

All right, I'll search for those and read about them.

It's unfortunate when a project I've worked on for a while is never used, but at least nice when I learned something from writing it.

Thanks!

ajb · 2024-03-28T23:08:52 1711667332

The search term you want for that is "binary erasure channel" + LDPC decoding .

aero_code · 2024-03-30T01:14:11 1711761251

I hadn't heard that term before. I'll search for it. Thanks!

nsguy · 2024-03-28T20:50:02 1711659002

I haven't heard about LDPCs before. Thanks!

Do they serve the same use case though? With Reed-Solomon the idea is to recover from complete loss of a fragment of data (erasure coding), isn't LPDC strictly for error correction/"noise" (e.g. certain bits flipping but the data overall exists)?

dragontamer · 2024-03-28T20:54:24 1711659264

I admit that I haven't thought it all the way through, but in general, all error-correction codes I'm aware of have a simpler erasure-code version available too.

Reed Solomon traditionally is an error-correction code, for example. But has common implementations in its simplified erasure-only code. (Ex: fixing "lost data" is far easier than fixing "contradictory data").

I'm fairly certain that LDPC erasure codes is as simple as "Is there only one missing erasure in this particular code??" and "answer is LDPC XOR (other data) == missing-data".

EDIT: The "hard part" is the exact composition of (other data), of which there's many styles and different methodologies with tons of different tradeoffs.

mjb · 2024-03-28T18:26:20 1711650380

That's true too, this approach isn't limited to Reed-Solomon (or MDS codes). For non-MDS codes the fetch logic becomes a little more complicated (you need to wait for a subset you can reconstruct from rather than just the first k), but that's not a huge increase in complexity.

pjdesno · 2024-03-28T16:52:46 1711644766

There are new Intel instructions (GFNI) which accelerate things a lot, as well as various hacks to make it go fast. See https://www.reddit.com/r/ceph/comments/17z1w08/but_is_my_cpu... for some quick and dirty benchmarks on jerasure, one of the EC plugins for Ceph, IIRC not using GFNI. (TLDR: 25GB/s on a Ryzen 7700X)

gizmo686 · 2024-03-28T23:38:22 1711669102

That is true if you want a "perfect" algorithm, that can provide arbitrary M-of-N guarantees. But, if you are a bit more flexible in your requirements you can get some very cheep reconstruction.

I worked on a system that uses a variant of parity packet encoding. Basic parity packet encoding is very simple. You divide you data into N blocks, then send the XOR of all the blocks as an extra packet. Both sender and receiver maintain a running XOR of packets. As soon as the Nth packet has been received, they immidietly reconstruct the N+1th packet without any additional work. This ammounts to 1 extra XOR operation per unit of data, which is a trivial amount of overhead in almost any workload.

Of course, the above scheme is limited to N/N+1 recovery (and is probably as good as you can do for that particular use case).

However, it has a fairly simple extension to N/N+M recovery. Arrange the data in an NxM grid, and construct M sets of "extra" packets". The first set is constructed row wise, (effectivly devolving into the above case). For the second set, rotate each of the columns by their column index. So if R(x,y) is a redundant packet, and D(x,y) is a data packet at location (x,y) in the grid, you would have

  * R(0,0) = D(0,0) ^ D(1,0) ^ D(2,0) ^ ... D(N,0)
  * R(0,1) = D(0,1) ^ D(1,1) ^ D(2,1) ^ ... D(N,1)
  ...
  * R(0,M-1) = D(0,M-1) ^ D(1,M-1) ^ D(2,M-1) ^ ... D(N,M-1)
  * R(1,0) = D(0,0) ^ D(1,1) ^ D(2,2) ^ ... D(N,N%M)
  * R(1,1) = D(0,1) ^ D(1,2) ^ D(2,3) ^ ... D(N,(N+1)%M)
  * R(1,M-1) = D(0,M-1) ^ D(1,0) ^ D(2,1) ^ ... D(N,(N+1)%M)
  ...
  * R(2,0) = D(0,0) ^ D(1,2) ^ D(2,4) ^ ... D(N,2N%M)
  * R(3,0) = D(0,0) ^ D(1,3) ^ D(2,6) ^ ... D(N,3N%M)

Your overhead is now M XOR operations per unit of real data, which is still trivial for reasonable values of M. The downside of this scheme is that if the first redundancy packet is not enough to reconstruct the dropped packet, you need to wait for the entire NxM table to be sent, which could cause a significant long-tail spike in latency if you are not careful. (The upside of this downside, is it provides even stronger burst protection that a traditional K-of-M erasure coding. If you get even more creative with how you group packets for the extra redundancy packets, you can get even stronger burst protection). The other downside is you end up being less space efficient than Reed-Solomon error correction.

Interestingly, the recovery algorithm I described is not optimal in the sense that there are times where it fails to recover data that is theoretically recoverable. Recovering data in all theoretically possible cases probably would be quite intensive.

dmw_ng · 2024-03-28T15:28:26 1711639706

Nice to see this idea written about in detail. I had thought about it in the context of terrible availability bargain bucket storage (iDrive E2), where the cost of (3,2) erasure coding an object and distributing each segment to one of 3 regions would still be dramatically lower than paying for more expensive and more reliable storage.

Say 1 chunk lives in Germany, Ireland and the US each. Client races GETs to all 3 regions and cancels the request to the slowest to respond (which may also be down). Final client latency is equivalent to that of the 2nd slowest region, with substantially better availability due to the ability to tolerate any single region being down

Still wouldn't recommend using E2 for anything important, but ^ was one potential approach to dealing with its terribleness. It still doesn't address the reality of when E2 regions go down, it is often for days and reportedly sometimes weeks at a time. So reliable writing in this scenario would necessitate some kind of queue with capacity for weeks of storage

There are variants of this scheme where you could potentially balance the horrible reliability storage with some expensive reliable storage as part of the same system, but I never got that far in thinking about how it would work

derefr · 2024-03-28T16:10:50 1711642250

Having not heard of E2 before (it never seems to come up in the comparisons the other object-storage providers do to make themselves look good) — do you know why it has such bad availability? "Weeks of downtime" sounds crazy for a business focused on storage.

If they aren't also failing at durability, then it wouldn't be any of the classical problems associated with running a storage cluster. Do they just... not bother with online maintenance / upgrades / hardware transitions?

dmw_ng · 2024-03-28T16:24:58 1711643098

My best guess is their primary product is a Windows (I think) backup client that has a place for smarts allowing them to paper over problems, or something along those lines. Feels like an "oh Germany is down again, when is Frank back from holiday so he can catch a plane to replace the switch?" type affair. If you google around the Reddit data hoarder communities you'll find plenty of war stories about E2

ddorian43 · 2024-03-28T15:53:55 1711641235

> Client races GETs to all 3 regions and cancels the request to the slowest to respond

Dropbox does this internally in magic pocket (see their eng blog)

sujayakar · 2024-03-28T16:00:47 1711641647

this is a really cool idea.

one followup I was thinking of is whether this can generalize to queries other than key value point lookups. if I'm understanding correctly, the article is suggesting to take a key value store, and for every `(key, value)` in the system, split `value` into fragments that are stored on different shards with some `k` of `M` code. then at query time, we can split a query for `key` into `k` subqueries that we send to the relevant shards and reassemble the query results into `value`.

so, if we were to do the same business for an ordered map with range queries, we'd need to find a way to turn a query for `interval: [start, end]` into some number of subqueries that we could send to the different shards and reassemble into the final result. any ideas?

ddorian43 · 2024-03-28T16:22:57 1711642977

There https://ydb.tech/ open source db that uses erasure coding for replication in single zone/region.

eivanov89 · 2024-03-28T16:41:24 1711644084

In YDB with block 4+2 erasure coding, you need half the disk space compared to mirror-3-dc schema. Meanwhile CPU usage is just a little bit higher, thus in high throughput tests mirror-3-dc wins. Indeed as mentioned in the post there might be a tail latency win in latency runs, but if your task is throughput with a reasonable latencies, replication might be a better choice.

pjdesno · 2024-03-28T16:56:18 1711644978

If you only care about throughput, just fetch the data and read should be the same speed as triple replicated.

For writing, triple-rep has to write 2x as much data or more, so it's going to be slower unless your CPUs are horribly slow compared to your drives.

ddorian43 · 2024-03-28T21:27:30 1711661250

I expect it to save a lot of CPU by only needing 1/3x of compactions. You might want to do a benchmark on that ;). An example is quickwit (building inverted indexes is very expensive).

ddorian43 · 2024-03-28T16:22:13 1711642933

> so, if we were to do the same business for an ordered map with range queries, we'd need to find a way to turn a query for `interval: [start, end]` into some number of subqueries that we could send to the different shards and reassemble into the final result. any ideas?

Dbs that are backed by s3-like-storage, the storage does this for you, but for blocks of, say, 1MB, and not per-kv (high overhead).

Think you use rocksdb in your db, and erasure-code the sstables.

loeg · 2024-03-28T15:49:33 1711640973

Yeah. And you get the storage for free if your distributed design also uses the erasure-encoded chunks for durability. Facebook's Warm Storage infrastructure does something very similar to what this article describes.

benlivengood · 2024-03-28T17:43:33 1711647813

The next level of efficiency is using nested erasure codes. The outer code can be across regions/zones/machines/disks while the inner code is across chunks of a stripe. Chunk unavailability is fast to correct with an extra outer chunk and bit rot or corruption can be fixed by the inner code without an extra fetch. In the fast path only data chunks need to be fetched.

dragontamer · 2024-03-28T18:20:49 1711650049

That's just LDPC but not as extreme.

If you're going to "nest" erasure codes, might as well make them XOR-based (fastest operation on modern CPUs and/or hardware), calculate a randomization scheme that has very-very high probability (99.9%+) of fixing errors, and other such benefits.

To provide assurances that you have enough LDPC, you then run LDPC on your LDPC-check bits. Then you run LDPC on your LDPC-LDPC-check bits. Then you run LDPC on your LPDC-LDPC-LDPC-check bits, until you've got the target probability / chance of fixing errors that you desire.

--------

LDPC's brilliance is that XOR-as-erasure-code is exceptionally fast, and this "repeated hierarchy" of error-correction codes leads to high-probabilities (99.9999%+) of successful correction of erasures.

siscia · 2024-03-28T15:23:56 1711639436

Nice to see this talked about here and Marc being public about it.

AWS is such a big place that even after a bit of tenure you still got place to look to find interesting technical approaches and when I was introduced to this schema for Lambda storage I was surprised.

As Marc mentions it is such a simple and powerful idea that is definitely not mentioned enough.

mjb · 2024-03-28T15:37:46 1711640266

If folks are interested in more details about Lambda's container storage scheme, they can check out our ATC'23 paper here: https://www.usenix.org/conference/atc23/presentation/brooker

pjdesno · 2024-03-28T17:09:02 1711645742

Note that the fast_read option on erasure-coded Ceph pools uses the same technique. I'm not sure which release it was in, but I see commits referencing it from 2015.

ghusbands · 2024-03-29T13:31:47 1711719107

The first graph is incredibly misleading. The text talks about fetching from 5 servers and needing 4 results vs fetching from 1 server and needing 1 result. Then the graph compares 4-of-5 to 4-of-4 latency, which is just meaningless. It should compare 4-of-5 with 1-of-1.

jeffbee · 2024-03-28T16:22:45 1711642965

I do not follow. How is it possible that the latency is lower in a 4-of-5 read of a coded stripe compared to a 1-of-4 replicated stripe?

benlivengood · 2024-03-28T17:37:55 1711647475

Full replication is probably always lower latency but costs N times as many copies that you want to store whereas erasure coding costs N/M. The cost of replicating write traffic is similar.

jeffbee · 2024-03-28T18:22:57 1711650177

Certainly, but there are other tradeoffs. You can write a byte to a replicated stripe, but encoded stripes either demand full-stripe writes or writing a byte is hilariously expensive. It's not clear in the blog if we are talking about sealed read-only data or what.

benlivengood · 2024-03-28T21:03:54 1711659834

Practically all modern storage has a minimum sector write size; now commonly 4096 bytes for HDD and megabytes for NAND. Most systems with fancy erasure codes address this by gathering multiple writes into single stripes and are effectively append-only filesystems with garbage collection to support snapshots.

A large number of RS erasure codes are over GF(2^8) which could allow individual byte updates to a stripe for e.g. Octane as storage. Generally, of course, most filesystems checksum at a block level and so byte-writes are rarely going to be implemented as such.

vlovich123 · 2024-03-28T21:37:40 1711661860

Small correction I think. The 4KiB for HDDs is fairly new and afaik most NAND is going to also be 4KiB with the exception of QLC which last I saw had a 64KiB minimum sector write size. If you have a source that NAND flash is using MiB minimum sector size, can you point me to that link?

toast0 · 2024-03-29T00:01:14 1711670474

4k sectors for hard drives has been around since roughly 2010, depending on your specific drive. 512e drives are very common; if you do a write that's not 4k aligned, the drive will turn it into a read / modify / write cycle.

ec109685 · 2024-03-28T16:42:52 1711644172

Aren’t they comparing 4 of 5 with 4 of 4?

jeffbee · 2024-03-28T17:19:24 1711646364

If so that doesn't make a great deal of sense because the alternative to a coded stripe isn't a unique unreplicated stripe, it's a replicated one. So the alternative would be to race multiple reads of all known replicas, or to hedge them.

ec109685 · 2024-03-28T18:58:51 1711652331

Good point. I guess it’s really 2 out of 2 versus 4 out of 6.