*However, if the popcounts in your workload remain independent, and you immediat...

JoshTriplett · on March 13, 2016

That looks like another case of a bitset, where you want to accumulate the popcounts over the whole bitset. That seems quite similar to the case in this article. (I still find it surprising that popcount doesn't have enough internal optimization to win, but the numbers certainly prove that.)

The case I'm talking about is when you need a single popcount value in isolation as part of some larger computation that doesn't otherwise involve popcounts.

For example, I've worked with libraries that manipulate variable-length data structures where each bit in a flags field indicates the presence of a chunk of data. To compute the size of the data, you need the popcount of the flags field. So you get the popcount (using __builtin_popcount when available), and then immediately use that value.