> The results follow my expectations: the simplest vectorized classification routine has the best performance. However, you may observe that even a rather naive SIMD approach can be quite fast in this instance.
I've recently written my first SIMD code [1]. This matches my observation: you get a big improvement just moving from scalar code to autovectorized code (i.e. working in fixed widths and telling the compiler to use specific CPU features you've detected), another decent improvement going to basic use of vendor intrinsics, and then more and more modest improvements from extra sophistication.
[1] uyvy->i420 pixel format conversion code, a much easier application of SIMD. No branching, just a bunch of pixels transformed in identical fashion.
I think most optimisation work is like that. Early effort with each technique can yield large gains. But the marginal gain of using any specific technique decreases over time.
For example, I've gotten a lot of speedups from essentially decreasing the number of malloc calls my programs make. Its often the case that ~80% of all allocations in a program come from just a few hotspots. Rewriting those hotspots to use a better data structure can yields big speed improvements, both because malloc & free are expensive calls and because the CPU hates chasing pointers. But there's usually only so much benefit in reducing allocations. At some point, it makes sense to just accept malloc calls.
The reason is totally logical. Lets say you reduce the number of allocations your program does by 10x and that yields a 45% performance improvement (10 seconds -> 5.5 seconds). You might think reducing allocations by 10x again would yield another 45% performance improvement - but thats just not how the math works out. We should expect that would take your 5.5 seconds down to 5.05 seconds - which is just a 9% improvement. That might not be worth it, given the next 10x reduction in malloc calls will probably be much harder to achieve.
If you want another 50% perf improvement, you need to run that profiler again and look at where the new hotspots are. If the CPU is spending less time following pointers, it'll now be spending more time (proportionately) running linear code. Maybe this time the performance wins will be found using SIMD. Or by swapping to a different algorithm. Or multithreading. Or by making better use of caching. Or something else - who knows.
I've recently written my first SIMD code [1]. This matches my observation: you get a big improvement just moving from scalar code to autovectorized code (i.e. working in fixed widths and telling the compiler to use specific CPU features you've detected), another decent improvement going to basic use of vendor intrinsics, and then more and more modest improvements from extra sophistication.
[1] uyvy->i420 pixel format conversion code, a much easier application of SIMD. No branching, just a bunch of pixels transformed in identical fashion.