The link you’ve posted doesn’t mean you can do large dense matmul in CUDA right away, it’s just that you can use fixed sized vector/matrix operations inside kernels (which might be useful if you’re writing graphics code and need some vec3/mat3/quats.) You still have to write your own customized kernel for a large dynamic-sized gemm computation (with tiling and shared memory and all that jazz), and at that point it’ll be best to just use CUBLAS.
> By default, when Eigen's headers are included within a .cu file compiled by nvcc most Eigen's functions and methods are prefixed by the device host keywords making them callable from both host and device code.
Eigen casually overrides "*" to do any kind multiplication, hence I guess it'll also carry the GeMM functions alongside to the kernel during compilation, but this needs to be tested.
Considering the speed I got from running Eigen on CPU, I still need to find larger problems to make that effort worthwhile, however.
The link you’ve posted doesn’t mean you can do large dense matmul in CUDA right away, it’s just that you can use fixed sized vector/matrix operations inside kernels (which might be useful if you’re writing graphics code and need some vec3/mat3/quats.) You still have to write your own customized kernel for a large dynamic-sized gemm computation (with tiling and shared memory and all that jazz), and at that point it’ll be best to just use CUBLAS.