More

aphyr · 2025-01-10T17:46:23 1736531183

The fundamental definition of consensus already requires that some proposal will eventually (given sufficient communicating, non-faulty, non-malicious nodes) win, so that doesn't seem particularly novel: https://lamport.azurewebsites.net/pubs/lower-bound.pdf

The core problem here that it's impossible for any replica to tell whether its speculatively-executed operations were legal or not; it might have to admit "sorry, I lied to you", and back up at any point. That seems to be what they mean by "partial progress". You can guess that things might happen, but you will often be wrong.

This idea's been around for a while--Fekete et al outlined speculative execution at local replicas with eventual convergence on a Serializable history in their 1996 PODC paper "Eventually Serializable Data Services": https://groups.csail.mit.edu/tds/papers/Lynch/podc96-esds.pd.... These systems have essentially two tiers of operations: weak operations, which the system might re-order (which could cause totally different results, including turning successes to failures and vice-versa!), and strong operations, which are Serializable. I assume Cassandra does the same. Assuming it's correct, the strong operations work like any other consensus system: they must block (or abort) when there isn't sufficient communication with other replicas. The weak ones might give arbitrarily weird results. In CAP's terms, the strong operations are CP, the weak ones are AP.

You see similar dynamics at play in probabilistic consensus systems, like Bitcoin, by the way. In Bitcoin technically all operations are weak ones, but the probability of non-Serializable outcomes should decrease quickly over time.

Having a consensus system that merges conflicting proposals is a nice idea, but I don't think Cassandra is novel here either. I don't have a citation handy, but I recall a conversation with Heidi Howard at HPTS (maybe 2017?) where she explained that one of the advantages of leaderless Paxos is that when you're building a replicated state machine, you can treat what would normally be conflicting proposals from multiple replicas as sets of transitions. Instead of rejecting all but one proposal, you can union them, then apply them to the state machine in some (deterministic) order--say, by lexicographically sorting the proposals.

aphyr · 2024-11-13T05:08:32 1731474512

Kafka actually does call these transactions! However (and this is a loooong discussion I can't really dig into right now) there's sort of two ways to look at "exactly once". One is in the sense that DB transactions are "exactly once"; a transaction's effects shouldn't be duplicated or lost. But in another sense "exactly once" is a sort of dataflow graph property that relates messages across topic-partitions. That's a little more akin to ACID "consistency".

You can use transactions to get to that dataflow property, in the same sort of way that Serializable transaction systems guarantee certain kinds of domain-level consistency. For example, Serializability guarantees that any invariant preserved by a set of transactions, considered purely in isolation, is also preserved by concurrent histories of those transactions. I think you can argue Kafka intends to reach "exactly-once semantics" through transactions in that way.

aphyr · 2024-11-12T23:29:35 1731454175

Both are true, but we use "transactions" for clarity, since the semantics of consumers outside transactions is even murkier. Every read in this workload takes place in the context of a transaction, and goes through the transactional offset commit path.

zbentley · 2024-11-13T02:22:23 1731464543

Ah, got it; I was assuming that “transactions” was referring to the transactions mentioned as the subject of the previous sentence, not the transactions active in consumers observing those. My mistake!

aphyr · 2024-11-12T22:57:44 1731452264

Nope. You'll find a full list of analyses here: https://jepsen.io/analyses

aphyr · 2024-11-12T16:47:42 1731430062

I have not yet, though you're not the first to ask. Some folks have suggested it might be... how do you say... fun? :-)

speedgoose · 2024-11-12T18:18:02 1731435482

If you are looking for fun targets, may I suggest KubeMQ too? Its author claims that it’s better than Kafka, Redis and RabbitMQ. It’s also "kubernetes native" but the open source version refuses to start if it detects kubernetes.

SahAssar · 2024-11-12T22:04:31 1731449071

> It’s also "kubernetes native" but the open source version refuses to start if it detects kubernetes.

I thought you were kidding, but this is crazy. https://github.com/kubemq-io/kubemq-community/issues/32

And it seems like you cannot even see pricing without signing up or contacting their sales: https://kubemq.io/product-pricing/

mathfailure · 2024-11-12T22:50:29 1731451829

This is just pure gold of an anecdote :-)

aphyr · 2024-11-12T16:42:13 1731429733

I would love to do a Kafka analysis. :-)

diggan · 2024-11-12T17:36:41 1731433001

Not that I'm a Kafka user, but I greatly appreciate your posts, so thank you :)

Maybe Kafka users should do a crowdfund for it if the companies aren't willing. Realistically, what would the goal of the crowdfund have to be for you to consider it?

jwr · 2024-11-12T17:33:43 1731432823

I'm still hoping Apple (or Snowflake) will pay you to do an analysis of FoundationDB…

tptacek · 2024-11-12T20:53:22 1731444802

I do too, but doesn't FDB already do a lot of the same kind of testing?

kasey_junk · 2024-11-12T21:03:17 1731445397

They are famous for doing simulation testing. https://antithesis.com/ Have recently brought to market a simulation testing product.

SahAssar · 2024-11-12T21:11:24 1731445884

I think they do similar testing, and therefore it might be even more interesting to read what Kyle thinks of their different approaches to it.

jwr · 2024-11-12T22:25:10 1731450310

Yes. But going through Jepsen and surviving is different. Gives an entirely new reputation to a database.

kasey_junk · 2024-11-12T22:37:56 1731451076

I don’t think ayphr would disagree with me when I say that FDB’s testing regime is the gold standard and Jepsen is trying to get there, not the other way around.

aphyr · 2024-11-12T22:55:53 1731452153

I'm not sure. I've worked on a few projects now which employed simulation testing and passed, only to discover serious bugs using Jepsen. State space exploration and oracle design are hard problems, and I'm not convinced there's a single, ideal path for DB testing that subsumes all others. I prefer more of a "complete breakfast" approach.

On another axis: Jepsen isn't "trying to get there [to FDB's testing]" because Jepsen and FDB's tests are solving different problems. Jepsen exists to test arbitrary, third-party databases without their cooperation, or even access to the source. FoundationDB's test suite is designed to test FoundationDB, and they have political and engineering buy-in to design the database from the ground up to cooperate with a deterministic (and, I suspect, protocol-aware) simulation framework.

To some extent Antithesis may be able to bridge the gap by rendering arbitrary distributed binaries deterministic. Something I'd like to explore!

tptacek · 2024-11-13T04:11:07 1731471067

This is a super interesting distinction and I'm glad I wrote my superficial drive-by comment about FDB's testing to prompt you to make it. :)

kasey_junk · 2024-11-12T22:58:46 1731452326

Has your opinion changed on that in the last few years? I could have sworn you were on record as saying this about foundation in the past but I couldn’t find it in my links.

aphyr · 2024-11-12T23:09:43 1731452983

I don't think so, but I've said a lot about databases in the last fifteen years haha.

Sometimes I look at what people say about FDB and it feels like... folks are putting words in my mouth that I don't recognize. I was very impressed by a short phone conversation with their engineers ~12 years ago. That's good, but that's not, like, a substantive experimental evaluation. That's "I focus my unpaid efforts on databases which seem more likely to yield fun, interesting results".

jwr · 2024-11-13T13:55:08 1731506108

So, will we get an evaluation of FDB one day? Pretty please? :-)

aphyr · 2024-11-13T14:27:33 1731508053

Apple is positively swimming in money! They could pay me! (Hi, Apple ;-))

kasey_junk · 2024-11-12T23:13:13 1731453193

Fair enough.

DylanSp · 2024-11-13T00:29:30 1731457770

Looks like it was an offhand tweet from 2013: https://web.archive.org/web/20220805112242/https://twitter.c.... I got that from a comment on the first Antithesis post on HN, https://news.ycombinator.com/item?id=39376195.

EdwardDiego · 2024-11-13T02:49:21 1731466161

Hey mate, think we interacted briefly on the Confluent Slack while you were working on this, something about outstanding TXes potentially interfering with consumption in the same process IIRC?

This isn't the first time you've discussed how parlous the Kafka tx spec is - not that that's even really a spec as such. I think this came up in your Redpanda analysis.

(And totally agree with you btw, some of the worst ever customer Kafka issues I dealt with at RH involved transactions.)

So was wondering what your ideal spec would look like, because I'd be interested in trying to capture the tx semantics in something like TLA+ as a learning experience - and because it would only help FOSS Kafka and FOSS clients improve, especially now that Confluent has withdrawn so much from Apache Kafka development.

aphyr · 2024-11-13T05:01:30 1731474090

I'm not really sure how to answer this question, but even a few chapters worth of clear prose would go a long way. We lay out a bunch of questions in the discussion section that would be really helpful in firming up intended txn semantics.

EdwardDiego · 2024-11-14T02:19:50 1731550790

Cheers, good place for me to start digging :)

monksy · 2024-11-12T19:04:46 1731438286

I would love to read your Kafka analysis

aphyr · 2024-11-12T16:06:11 1731427571

Ack, pardon me. That should be fixed now!

aphyr · 2024-11-12T15:47:05 1731426425

It is a little surprising, and I agree, the docs here are not doing a particularly good job of explaining it. It might help to ask: if you don't explicitly commit, how does Kafka know when you've processed the messages it gave you? It doesn't! It assumes any message it hands you is instantaneously processed.

Auto-commit is a bit like handing someone an ice cream cone, then immediately walking away and assuming they ate it. Sometimes people drop their ice cream immediately after you hand it to them, and never get a bite.

justinsaccount · 2024-11-12T16:19:31 1731428371

Weird, I would have guessed that it auto commits the previous batch when it polls for the next batch, meaning it would be like

  loop:
    messages = poll() # poll returns new messages and commits previous batch
    process(messages)

but it sounds like it "poll returns new messages and immediately commits them."

williamdclt · 2024-11-12T16:33:43 1731429223

Information on the internet about this seems unreliable, confusing and contradictory... It's crazy for something so critical, especially when it's enabled by default.

dangoodmanUT · 2024-11-12T15:50:17 1731426617

This, it has no idea that you processed the message. It assumes processing is successful by default which is cosmically stupid.

frant-hartm · 2024-11-13T10:22:43 1731493363

> how does Kafka know when you've processed the messages it gave you?

By calling `poll()` again. It doesn't commit the records returned from poll until auto commit interval expires AND you call poll again.

At least this is what the javadoc says quite clearly: https://kafka.apache.org/39/javadoc/org/apache/kafka/clients...

Note: Using automatic offset commits can also give you "at-least-once" delivery, but the requirement is that you must consume all data returned from each call to poll(Duration) before any subsequent calls, or before closing the consumer.

E.g. the following commits every 10s - on each call to `poll`, it doesn't automagically commit every 5 s.

        Properties props = new Properties();
        props.setProperty("bootstrap.servers", "localhost:9092");
        props.setProperty("group.id", "test");
        props.setProperty("enable.auto.commit", "true");
        props.setProperty("auto.commit.interval.ms", "5000");
        props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
        consumer.subscribe(Arrays.asList("my-topic"));
        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
            for (ConsumerRecord<String, String> record : records) {
                System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
            }
            Thread.sleep(10_000);
        }

frant-hartm · 2024-11-13T10:57:00 1731495420

Just a note: I am not claiming it is working correctly, only saying there is a clear and documented way how the client knows when to commit, and that it works as expected in a simple scenario.

williamdclt · 2024-11-12T16:10:42 1731427842

> if you don't explicitly commit, how does Kafka know when you've processed the messages it gave you?

I did expect that auto-commit still involved an explicit commit. I expected that it meant that the consumer side would commit _after_ processing a message/batch _if_ it had been >= autocommit_interval since the last commit. In other words, that it was a functionality baked into the Kafka client library (which does know when a message has been processed by the application). I don't know if it really makes sense, I never really thought hard about it before!

I'm still a bit skeptical... I'm pretty sure (although not positive) that I've seen consumers with autocommit being stuck because of timeouts that were much greater than the autocommit interval, and yet retrying the same message in a loop

aphyr · 2024-11-12T16:35:42 1731429342

Here's a good article from New Relic on the problem, if you'd like more detail: https://newrelic.com/blog/best-practices/kafka-consumer-conf...

Or here, you can reproduce it yourself using the Bufstream or Redpanda/Kafka test suite. Here's a real quick run I just dashed off. You can watch it skip over writes: https://gist.github.com/aphyr/1af2c4eef9aacde7f08f1582304908...

lein run test --enable-auto-commit --bin bufstream-0.1.3-rc.12 --time-limit 30 --txn --final-time-limit 1/10000

jakewins · 2024-11-12T20:10:51 1731442251

Auto commit has always seemed super shady. Manual commit I have assumed is safe though - something something vector clocks - and it’d be really interesting to know if that trust is misplaced.

What is the process and cost for having you do a Jepsen test for something like that?

aphyr · 2024-11-12T22:13:35 1731449615

You'll find lots about the Jepsen analysis process here: https://jepsen.io/services/analysis

aphyr · 2024-11-13T16:52:19 1731516739

I (and apparently the Confluent docs?) may be wrong about this. I've added an update to the report.

aphyr · 2024-08-08T20:15:35 1723148135

For this test, you can do it on pretty much any reasonable Linux machine. Longer histories can churn through more CPU and RAM--some of the more aggressive tests I ran for this work involved 20 GB heaps and 50 cores--but you can tune all that lower.

aphyr · 2024-08-08T16:50:07 1723135807

> strict serializability doesn't imply idempotency

I think we're probably getting at the same thing, but I do want to clarify a bit. A Strict Serializable history, like a Serializable one, requires equivalence to a total order of transactions. That's clearly not true for etcd+jetcd: no possible order of transactions can allow (e.g.) a transaction to read from its own future. It's totally fine to submit non-idempotent transactions against a Serializable system: systems which actually provide Serializable will execute known-committed transactions exactly once.

Plenty of other databases pass this test; etcd+jetcd does not. This system is simply not Serializable.

mjb · 2024-08-08T16:55:37 1723136137

Maybe what I should have said is "you can't just retry transactions against a strict serializable database and expect to still get strict serializability (from the applications's perspective)". This is true of distributed system APIs more generally, too.

aphyr · 2024-08-08T17:05:14 1723136714

Yeah, that's a good way of phrasing it! :-)