However, if the popcounts in your workload remain independent, and you immediately need the values from them, I don't think this approach will help.
This can be true, but it depends on whether the rest of the workload is vectorized. If you already have the data in vector registers, or can make use of it there afterward, this approach turns out to be even more beneficial than the microbenchmark numbers imply.
That looks like another case of a bitset, where you want to accumulate the popcounts over the whole bitset. That seems quite similar to the case in this article. (I still find it surprising that popcount doesn't have enough internal optimization to win, but the numbers certainly prove that.)
The case I'm talking about is when you need a single popcount value in isolation as part of some larger computation that doesn't otherwise involve popcounts.
For example, I've worked with libraries that manipulate variable-length data structures where each bit in a flags field indicates the presence of a chunk of data. To compute the size of the data, you need the popcount of the flags field. So you get the popcount (using __builtin_popcount when available), and then immediately use that value.
This can be true, but it depends on whether the rest of the workload is vectorized. If you already have the data in vector registers, or can make use of it there afterward, this approach turns out to be even more beneficial than the microbenchmark numbers imply.
We've been working with Wojciech's approach for a while now, and thus can even point to real-world code: https://github.com/RoaringBitmap/CRoaring/blob/master/src/co...