Precise timing of machine code with Linux perf

A2017U1 · on May 5, 2019

This is great. Always love seeing what can be accomplished with just a shell.

Wish I could get more time to work on micro-optimisations, in this age you often get funny looks proposing to dig down and optimise sluggish code. Spin up more compute and call it a day they say.

While that's completely understandable from a business perspective, it's somewhat unsatisfying as a developer, the expertise that comes from it surely has lasting benefits too.

dendibakh · on May 5, 2019

Thanks. I'm glad you like the article. :)

darawk · on May 5, 2019

Would anyone who understands mind explaining the phrase "prefetch window"? I don't quite get what's meant by that from context, and can't seem to find anything from googling.

PowerfulWizard · on May 6, 2019

I'm just going off the article, here is my understanding.

The sample program is doing 2^7 = 128 NOPs, 4 at a time for 32 cycles, and then it is doing a memory access.

The address of memory that is going to be accessed is known right before doing the 32 cycles of "work", so a prefetch can be issued at that time.

The meaning of the 'prefetch windown' term is the number of cycles that you have between when you issue the prefetch to when you issue the instruction that accesses the address that was prefetched. So it is based on the structure of the program being analyzed.

darawk · on May 6, 2019

Hmm ya that explanation sounds right to me. Thanks for the help :)

vbernat · on May 5, 2019

This is quite interesting. Is there anything preventing the collection of the number of cycles spent in an arbitrary function? It seems this is just a matter of identifying all branches.

dendibakh · on May 6, 2019

Yes, that might be possible. However, you probably will get multiple cycle counts for the same function depending on which path was taken. And it works only if the amount of taken branches in the function is not that big (less than 32). Otherwise it will not fit into LBR stack. For example, if you have a loop with more than 32 iterations it will trash the LBR stack with backwards jumps. But yeah, for small functions it might work pretty well.

I would better go for analyzing not the whole function (all basic blocks of the function) but only the Hyper Blocks (typical hot path through the function). Here is the example of how to do it: https://lwn.net/Articles/680985/ chapter "Hot-path analysis".

CalChris · on May 7, 2019

Superblocks rather than Hyperblocks. Except for cmov which is partial predication, x86 doesn’t have predication. But SBs are probably what your optimizer wants anyways.