It's a great library but largely doesn't solve inter-process communication. This library looks like it is trying to make inter process/thread choices be a configuration change, which is a nice priory to have.
What will the reader do while polling? If they will spin/poll, that will consume CPU resource unnecessarily. If the reader does it frequently the OS will preempt the process and pay the price.
Typically, it takes around 50ns to 80ns to access RAM via north bridge. Most probably the CPU cache is screwing his measurements.
What can be done is part of the machine (CPU cores, interrupts) are isolated as much as possible and the latency sensitive workload load is run on that part. In that case spinning the CPU and using something like the timestamp counter (rdtsc or rdtscp assembly instruction) for timing. That will waste power and will prevent the CPU cores from going to sleep but it is possible to achieve good latencies.
As for 50-80ns access, you are right. And I would guess this workload probably represents the best case scenario and is already in cache.
In low latency usages you are doing everything in your power to avoid RAM access. Preventing preemption by "wasting" CPU resources is a common technique.
Depending on the requirements and dataflows of the system an even smarter event dispatcher can dynamically switch been busy waiting and normal event notification to reduce unnecessary power usage.
So ARMv7 you mean? This is CPU hardware level memory model dependant. Not absolutely sure, didn't look that hard, but I think this doesn't implement ARMv7 memory model correctly, so maybe it doesn't always function right. It might be possible for a message to be committed, but some cache lines containing message bytes still dirty.
This implementation is using sun.misc.unsafe afterall.
Different ARM CPUs implement different memory models. So what applies to one ARM design might not apply to another. ARMv8 is probably easiest to support due to new instructions, load-acquire and store-release.
Again, not sure without further analysis. And no time to do it.
You should make sure you truly understand what's the meaning of the word "latency".
Your story is not unique[1]. The basic concept is: latency != 1/throughput.
I have no interest in your codes. Just from your README, I am sure you make mistakes.
Here, do you ask several questions for you before publishing eye-catching headline? Such as,
1. what's the cost of one CAS op (and maybe volatile)?
2. what's the timing accuracy in Java?
3. what's the latency of one main memory accessing?
4. ok, more pitfalls in Java, OS and micro-benchmark coming...
if you have basic understanding to question#1-3, you may not claim "20 ns is possible. The best I've gotten with a bare bones optimized test was about 16 ns" yourself.
Seriously. Saying things like that make you look really bad - no one will take you seriously.
I got that 16 ns with cache-line aligned carefully tuned assembler. No false sharing. And it most definitely didn't use CAS, but FAA. I used CPU TSC timers.
That said, I'm pretty sure you can achieve close to that in Java as well. 20-40 ns latency between two CPU cores on same socket wouldn't surprise me at all. Throughput can be even higher, if you start to batch messages.
Anyways:
1) Cost of CAS depends on contention. About 15 ns in any case without contention.
2) Timing accuracy on Java... Well not that I really use Java much, but I'd imagine it's exactly same as on C++, if you can access TSC somehow. Nothing prevents from ping-ponging messages between two threads either a few million times -- that way even low resolution timer is more than enough. Yes, 20ish ns latency is doable.
3) Who cares about memory latency, the point is to communicate by cache line sharing. Main memory would defeat that.
Hi, vardump, my true point here is clear: latency != 1/throughput. Can you accept this viewpoint?
1. our talking context is Java. There is nothing about assembler here. I assume you are the author of this library. If not, let me know.
2. I spend a little time to see your perf source[1]. Tell us what's ping-pongs you did?
3. Even your micro testing Thread busy waiting on cores. The OS still has the right to do thread scheduling. Who gurantees you have not a cache line missing for next time running? This is possible by a relative complex steps in Java. But it is obvoiusly not shown in all of your codes and notes.
4. The cache line is not sharing directly. They become coherent by MESI or its variants[2]. You busy waiting causes the coherence storm for RFO, which is expensive in theory. Although some relieve these pain, this is another topic. You need to show these latencies or at least why you do not consider them in the nanosec level?
5. if assumed "15 ns in any case without contention" is right, then 1ns for the latency of rest instructions? The cost of above cache coherence? The cost of a function call pro/epi in hotspot-JITed codes? The overhead by volatile of forbidding some JIT? OK, how about counting on GC?
6. Ping-pong is still not good enough for nanosec level latency measurement. Because, you have the "endpoint effect". At least, you need to copy? what's the memory copy throughput latency?(note again: latency != 1/throughput)
7. Finally, your codes just show the throughput of batch writing/reading into/from memory by Java's mmap kinds. It is not the "p2p" latency in common sense.
I just leave HN readers to make their own judgement: I plan to announce a pico-second message-passing library because I have done the TB level L1 cache throughput in my Haswell core.
Nanosecond? I don't think so... Unless by having nanosecond latency you mean that it has a latency that is representable in nanoseconds... but that's not what it means to me. Tens or hundreds of microseconds I would buy, but not nanosecond.
If you busy poll, getting to 50 ns latency from producer to consumer is easy, 20 ns is possible. The best I've gotten with a bare bones optimized test was about 16 ns. You tend to get best results with high frequency dual core CPUs with hyper-threading disabled.
Bigger issue is that the reader needs to busy poll, consuming 100% CPU time on one CPU core. Maybe energy consumption could be reduced by using monitor/mwait - not sure if it's possible from user mode or only in kernel.
Another issues in this implementation is use of compare-and-swap. For this purpose, fetch-and-add (LOCK XADD on x86) would be more efficient. Multiple writer contention is much worse with CAS.
Record size is also fixed, you can't have multiple different message sizes.
Overhead per message is 12 bytes, two ints that can be either 0 or 1 and an int called Metadata. Seeing there's commit and rollback fields, both 1 byte, 2 byte overhead for commit and rollback should have been enough. You're going to get false sharing whether it's an int or a byte.
Just to throw it out there, I thought communication across 2 cores on different sockets is generally regarded as impossible in less than 100ns regardless of language? If you're staying in L3 on a single socket, I could believe it, maybe, but at a certain point you're tuning the benchmark rather than the application.
For low latency messaging like this you'd typically want to avoid cross socket data transfers. However, IIRC cacheline snooping across sockets is a bit quicker (Intel QPI at least) than remote dram access so if the lines stay in cache the communication may be under 100ns. Also, depending on access pattern, hardware prefetch may hide some of the latency.
> You better have serious throughput to justify that....
It is often about latency not just throughput. Although sometimes they go hand in hand. For example you can achieve pretty high throughput if you take the whole network stack outside the kernel and talk directly to the network card.
Do you have a particular reason to claim IPC isn't possessible sub micro? Cause I've written/used systems that performed IPC measured in 10s of nanos to the best of my ability to measure it.
High speed polling and/or the disabling of interrupts is totally a valid solution, esp when given such a high throughput that you will always have messages waiting there is actually no reason to be interrupt driven. No sleeping, no waiting, no waste.
Or your thread has something else it could be doing while messages are unavailable. It sounds like you haven't thought much about message passing at all and could probably be a little more humble.
I have written a number of message queue's. None of which had near the number of events required to justify polling like this. All of the message queue's I've written would be using tons of extra CPU if I had used a polling solution...
I realize that there probably are use cases out there that justify this, but I have not been tasked with one...
For sure if you can avoid it. But sometimes you can't. In those cases you need inter-thread communication to be as fast as possible.
A reason inter-process communication is interesting, especially in java, is that there is a reasonably common pattern of isolating low latency code in 1 JVM and less sensitive code in another to prevent system wide pauses from bleeding across.
This can be super painful for development iteration though so one thing that is interesting about this library is that it seems to make that configuration time concern.
It may also be beneficial to move some processing to a different core (perhaps even different socket) to avoid execution resource contention (e.g. d/i-cache, execution ports, etc).