If you work hard enough, you can use multiple DMA channels and stream the data in and out of GPU simultaneously, however it needs a datacenter class card (Tesla, Radeon Instinct, etc.). Also in nVidia's case, you can only do this while using CUDA only, because nVidia doesn't want you to get full performance out of these cards without CUDA.
Explain? I didn't think there was any way around paying the pcie cost. If your data is hosted in memory/cache, and the computation takes less than a few mics, I didn't think there was any way to make GPUs competitive.
It depends on what you’re doing. I’m strongly against moving to GPU just for the sake of it.
However, in some cases, where GPU brings substantial performance benefits in a long running (hours, days, etc.) application, you can setup pinned memory locations, where they map to GPU. Then you can feed the input part, where it feeds the GPU. If you have multiple DMA engines in the GPU, you can just stream the data from in to GPU to out, and read the results from your second pinned memory location.
If the speed bump is big enough, and you can eat the first startup cost, PCIe cost can become negligible.
However, it’s always horses for courses, and it may not fit your case at all.
In my case, Eigen is already extremely fast for what I do on CPU, and my task doesn’t run long. So adding GPU doubles my wall clock.
PCI cost is typically in the noise for realistic applications. There are ways to reduce latency and signal flags across PCIe, but do most of the workload on GPU to keep the latency down.
I'm guessing the guy criticizing latency sensitivity is talking about the applications where that does matter. There are plenty of applications in say finance where they need to run a 20 parameter dot product and get the result back in nanoseconds, and I don't think GPUs can ever handle work loads like that.
That's true, but for those types of applications you're looking at a custom ASIC in most cases. high ns or very low us are FPGA territory, and low us and above are suitable on GPU.
This video asserts that FPGAs do the 10ns work, and C++ does the >100ns work. I've never heard of ASICs being used, since the deployment cycle is too frequent to justify the expense of fabing.