We explain optimization techniques used to set three world speed records. Using a combination of code generation and hardware specific optimizations, we achieved a 20x speedup over hand tuned assembly. These techniques depend on two things 1) exploiting domain specific dependencies that are too specialized for a compiler to detect and too tedious for a programmer to exploit, and 2) knowing how to profile the operations being performed by your CPU. These optimizations can be successfully applied to CPU bound code in any compiled language for a wide range of analytics problems.
David Richardson has been optimizing distributed systems and analytics pipelines for the last 15 years. He has set several world records in scientific computing. He currently develops automated trading systems and analytics at SIG, a proprietary trading firm located in Philadelphia.