Show HN: IPC between multiple Java processes/JVMs with nanosecond latency

mbell · on May 26, 2015

For those looking for a high performance inter-thread library, I've found LMAX's disruptor library has worked quite well: https://lmax-exchange.github.io/disruptor/

fiatmoney · on May 27, 2015

The LMAX folks are legit, and any given youtube video by them is likely to be extremely informative.

kasey_junk · on May 27, 2015

It's a great library but largely doesn't solve inter-process communication. This library looks like it is trying to make inter process/thread choices be a configuration change, which is a nice priory to have.

halayli · on May 27, 2015

What will the reader do while polling? If they will spin/poll, that will consume CPU resource unnecessarily. If the reader does it frequently the OS will preempt the process and pay the price.

Typically, it takes around 50ns to 80ns to access RAM via north bridge. Most probably the CPU cache is screwing his measurements.

rdtsc · on May 27, 2015

What can be done is part of the machine (CPU cores, interrupts) are isolated as much as possible and the latency sensitive workload load is run on that part. In that case spinning the CPU and using something like the timestamp counter (rdtsc or rdtscp assembly instruction) for timing. That will waste power and will prevent the CPU cores from going to sleep but it is possible to achieve good latencies.

As for 50-80ns access, you are right. And I would guess this workload probably represents the best case scenario and is already in cache.

kasey_junk · on May 27, 2015

In low latency usages you are doing everything in your power to avoid RAM access. Preventing preemption by "wasting" CPU resources is a common technique.

halayli · on May 27, 2015

the OS scheduler will penalize you if you waste CPU resources.

theyoungestgun · on May 27, 2015

Then penalize the OS scheduler by removing your spinning CPU from its available resources ;)

jleahy · on May 27, 2015

There hasn't been such a thing as a north bridge since Nehalem in 2008. These days the access won't even hit RAM it'll just go to on-chip L3 cache.

halayli · on May 27, 2015

sandy bridge is 10% faster. Whether you hit the RAM or L3 cache depends on your memory access patterns.

bazookajoes · on May 27, 2015

Depending on the requirements and dataflows of the system an even smarter event dispatcher can dynamically switch been busy waiting and normal event notification to reduce unnecessary power usage.

clumsysmurf · on May 27, 2015

Any chance this works on Android?

vardump · on May 27, 2015

So ARMv7 you mean? This is CPU hardware level memory model dependant. Not absolutely sure, didn't look that hard, but I think this doesn't implement ARMv7 memory model correctly, so maybe it doesn't always function right. It might be possible for a message to be committed, but some cache lines containing message bytes still dirty.

This implementation is using sun.misc.unsafe afterall.

Different ARM CPUs implement different memory models. So what applies to one ARM design might not apply to another. ARMv8 is probably easiest to support due to new instructions, load-acquire and store-release.

Again, not sure without further analysis. And no time to do it.

motoboi · on May 27, 2015

Not only ARMv7, but running on Dalvik or ART.

vardump · on May 27, 2015

CPU memory model and atomics support is what matters here. Dalvik or ART are unlikely to have anything to do with it, as long as they provide access.

jinmingjian · on May 27, 2015

Come on, boy, stop joining headline Party!:)

You should make sure you truly understand what's the meaning of the word "latency".

Your story is not unique[1]. The basic concept is: latency != 1/throughput.

I have no interest in your codes. Just from your README, I am sure you make mistakes.

Here, do you ask several questions for you before publishing eye-catching headline? Such as,

1. what's the cost of one CAS op (and maybe volatile)?

2. what's the timing accuracy in Java?

3. what's the latency of one main memory accessing?

4. ok, more pitfalls in Java, OS and micro-benchmark coming...

if you have basic understanding to question#1-3, you may not claim "20 ns is possible. The best I've gotten with a bare bones optimized test was about 16 ns" yourself.

[1] http://www.infoq.com/articles/High-Performance-Java-Inter-Th...

vardump · on May 27, 2015

> Come on, boy, stop joining headline Party!:)

Seriously. Saying things like that make you look really bad - no one will take you seriously.

I got that 16 ns with cache-line aligned carefully tuned assembler. No false sharing. And it most definitely didn't use CAS, but FAA. I used CPU TSC timers.

That said, I'm pretty sure you can achieve close to that in Java as well. 20-40 ns latency between two CPU cores on same socket wouldn't surprise me at all. Throughput can be even higher, if you start to batch messages.

Anyways:

1) Cost of CAS depends on contention. About 15 ns in any case without contention.

2) Timing accuracy on Java... Well not that I really use Java much, but I'd imagine it's exactly same as on C++, if you can access TSC somehow. Nothing prevents from ping-ponging messages between two threads either a few million times -- that way even low resolution timer is more than enough. Yes, 20ish ns latency is doable.

3) Who cares about memory latency, the point is to communicate by cache line sharing. Main memory would defeat that.

4) ?

hyperpape · on May 27, 2015

http://shipilev.net/blog/2014/nanotrusting-nanotime/ By coincidence, this was reposted today. Findings: nanotime probably has >= 15ns latency, and >= 30 ns granularity.

vardump · on May 27, 2015

So does 1M ping pongs give me sufficient resolution?

It's not like it makes sense to measure a single transaction, but say, 1M in a chain.

jinmingjian · on May 27, 2015

Yes, as you are ware that, it is hard to measure the latency correctly especially in your nano-second level.

jinmingjian · on May 27, 2015

Hi, vardump, my true point here is clear: latency != 1/throughput. Can you accept this viewpoint?

1. our talking context is Java. There is nothing about assembler here. I assume you are the author of this library. If not, let me know.

2. I spend a little time to see your perf source[1]. Tell us what's ping-pongs you did?

3. Even your micro testing Thread busy waiting on cores. The OS still has the right to do thread scheduling. Who gurantees you have not a cache line missing for next time running? This is possible by a relative complex steps in Java. But it is obvoiusly not shown in all of your codes and notes.

4. The cache line is not sharing directly. They become coherent by MESI or its variants[2]. You busy waiting causes the coherence storm for RFO, which is expensive in theory. Although some relieve these pain, this is another topic. You need to show these latencies or at least why you do not consider them in the nanosec level?

5. if assumed "15 ns in any case without contention" is right, then 1ns for the latency of rest instructions? The cost of above cache coherence? The cost of a function call pro/epi in hotspot-JITed codes? The overhead by volatile of forbidding some JIT? OK, how about counting on GC?

6. Ping-pong is still not good enough for nanosec level latency measurement. Because, you have the "endpoint effect". At least, you need to copy? what's the memory copy throughput latency?(note again: latency != 1/throughput)

7. Finally, your codes just show the throughput of batch writing/reading into/from memory by Java's mmap kinds. It is not the "p2p" latency in common sense.

I just leave HN readers to make their own judgement: I plan to announce a pico-second message-passing library because I have done the TB level L1 cache throughput in my Haswell core.

[1] https://github.com/caplogic/MappedBus/blob/master/src/perf/i... [2] http://en.wikipedia.org/wiki/MESI_protocol

dicroce · on May 26, 2015

Nanosecond? I don't think so... Unless by having nanosecond latency you mean that it has a latency that is representable in nanoseconds... but that's not what it means to me. Tens or hundreds of microseconds I would buy, but not nanosecond.

vardump · on May 26, 2015

If you busy poll, getting to 50 ns latency from producer to consumer is easy, 20 ns is possible. The best I've gotten with a bare bones optimized test was about 16 ns. You tend to get best results with high frequency dual core CPUs with hyper-threading disabled.

Bigger issue is that the reader needs to busy poll, consuming 100% CPU time on one CPU core. Maybe energy consumption could be reduced by using monitor/mwait - not sure if it's possible from user mode or only in kernel.

Another issues in this implementation is use of compare-and-swap. For this purpose, fetch-and-add (LOCK XADD on x86) would be more efficient. Multiple writer contention is much worse with CAS.

Record size is also fixed, you can't have multiple different message sizes.

Overhead per message is 12 bytes, two ints that can be either 0 or 1 and an int called Metadata. Seeing there's commit and rollback fields, both 1 byte, 2 byte overhead for commit and rollback should have been enough. You're going to get false sharing whether it's an int or a byte.

jbooth · on May 27, 2015

Just to throw it out there, I thought communication across 2 cores on different sockets is generally regarded as impossible in less than 100ns regardless of language? If you're staying in L3 on a single socket, I could believe it, maybe, but at a certain point you're tuning the benchmark rather than the application.

vitalyd · on May 27, 2015

For low latency messaging like this you'd typically want to avoid cross socket data transfers. However, IIRC cacheline snooping across sockets is a bit quicker (Intel QPI at least) than remote dram access so if the lines stay in cache the communication may be under 100ns. Also, depending on access pattern, hardware prefetch may hide some of the latency.

orf · on May 26, 2015

> For this purpose, fetch-and-add would be more efficient. Multiple writer contention is much worse with CAS.

Why, out of interest?

vardump · on May 27, 2015

When there's heavy amount of updates, CAS loop updates will start to fail, so it needs to be retried the more the more there is contention.

Fetch-and-add is totally wait-free, much better upper bound. It always succeeds on first try.

_ezkx · on May 27, 2015

I can confirm. If you're curious I wrote a fast concurrent (single process) queue library for Haskell based around fetch-and-add:

https://github.com/jberryman/unagi-chan

Interested to check out your work! I'd love to be able to extend my library to support IPC in some way.

dicroce · on May 27, 2015

Ugh. Polling? You better have serious throughput to justify that....

rdtsc · on May 27, 2015

> You better have serious throughput to justify that....

It is often about latency not just throughput. Although sometimes they go hand in hand. For example you can achieve pretty high throughput if you take the whole network stack outside the kernel and talk directly to the network card.

http://www.intel.com/content/dam/www/public/us/en/documents/...

But I've seen this spin polling done when latency needed to be optimized.

vardump · on May 27, 2015

Yes, polling. If kernel gets involved, we're talking about microseconds.

kasey_junk · on May 27, 2015

Do you have a particular reason to claim IPC isn't possessible sub micro? Cause I've written/used systems that performed IPC measured in 10s of nanos to the best of my ability to measure it.

dicroce · on May 27, 2015

I hadn't considered someone would use polling like this. Profoundly wasteful of the CPU unless you have a ridiculous throughput.

sitkack · on May 27, 2015

High speed polling and/or the disabling of interrupts is totally a valid solution, esp when given such a high throughput that you will always have messages waiting there is actually no reason to be interrupt driven. No sleeping, no waiting, no waste.

Polling is not necessarily evil or wasteful.

_ezkx · on May 27, 2015

Or your thread has something else it could be doing while messages are unavailable. It sounds like you haven't thought much about message passing at all and could probably be a little more humble.

dicroce · on May 27, 2015

I have written a number of message queue's. None of which had near the number of events required to justify polling like this. All of the message queue's I've written would be using tons of extra CPU if I had used a polling solution...

I realize that there probably are use cases out there that justify this, but I have not been tasked with one...

kasey_junk · on May 27, 2015

Or low latency requirements.

ma2rten · on May 27, 2015

Maybe in that case you should use a single thread and avoid ipc altogether?

kasey_junk · on May 27, 2015

For sure if you can avoid it. But sometimes you can't. In those cases you need inter-thread communication to be as fast as possible.

A reason inter-process communication is interesting, especially in java, is that there is a reasonably common pattern of isolating low latency code in 1 JVM and less sensitive code in another to prevent system wide pauses from bleeding across.

This can be super painful for development iteration though so one thing that is interesting about this library is that it seems to make that configuration time concern.

vitalyd · on May 27, 2015

It may also be beneficial to move some processing to a different core (perhaps even different socket) to avoid execution resource contention (e.g. d/i-cache, execution ports, etc).

tlrobinson · on May 27, 2015

(for those not so good at math, 1ns = 1 clock cycle at 1GHz)