[0] still looks bad for complex (CGEMM, I assume). It doesn't have comparisons there, but Eigen's claim to be faster than any free BLAS is surely wrong. (I wouldn't doubt its similar if it has the right loop structure and prefetching, and a few percent is within the normal variance of HPC runs anyway, so not worth worry about.) Unfortunately it isn't convenient for me, not being clever enough for C++, but I've nothing against any good free linear algebra implementation for a platform I'm using.
I can't comment for performance of Eigen concerning complex calculations, since my work lives in the real domain.
On the other hand, Eigen didn't claim to be faster than any BLAS during my interaction with informational side of their website. Nowadays, I just get the latest version, update my code and directly reach for their reference guide.
However, being used by both CERN (in LHC ATLAS) and TensorFlow, I guess they're no slouch in terms of performance (it also genuinely amazed my heavy BLAS fan professor at my Jury when the performance numbers came up). Another plus is, Eigen's ability to be used in CUDA kernels (though I didn't try it yet).
On the other hand, they're insanely fast (almost BLAS fast in my experience) for what I'm doing, and it's really well optimized (shows when you use -O3, especially on the solvers side). Also, it has neat capabilities like dynamic matrices and instant submatrix access (you can get a submatrix of a matrix, or just patch in a submatrix in a single call).
About RAM placement: From my experience, code becomes overly dependent on non-fragmented memory when allocating big matrices and/or vectors naively (i.e. in a contiguous manner), and execution fails as the node you're running on (or your computer) gets busier with other code. Eigen counters this by intelligently dividing bigger matrices into smaller chunks, both optimizing the access time and reducing (or effectively eliminating from my experience) the risk of segmentation faults related to failed memory allocations due to not having not enough undivided space in the memory space.
> On the other hand, Eigen didn't claim to be faster than any BLAS during my interaction with informational side of their website.
It claims to be faster than any free BLAS in the FAQ (but talks about ATLAS and GotoBLAS there). Given how close GotoBLAS in the guise of OpenBLAS gets to peak performance for GEMM, that would be dubious without measuring 2/3 of OB DGEMM performance. I assume claims of speed are for when BLAS is actually providing it. (I'm too experienced to be susceptible to appeals to authority of HEP software!)
Thanks for explaining the memory thing. I'd hope HPC code wouldn't be subject to SEGVs due to memory allocation failing (for two reasons), but you typically want arrays on the stack and otherwise large pages to minimize TLB misses (per Goto).
I'm not intending to bash Eigen, happy if people can use performant free software.
What's "RAM placement" in this context?