The Difficulty of Writing Fast Programs

Item 1: Writing fast programs is hard

Writing fast programs is hard because your semantics are generally useless now. Gone are niceties like dynamically managed memory, reference counted pointers, garbage collection, and any other constructs to hold your hand. Your program is going to look ugly. Polymorphism isn’t your friend. Every feature hates you. N.B. it’s not actually this bad usually, but it can certainly feel this way sometimes.

Item 2: Designing fast programs is hard

Even harder than writing the fast program, is conceiving it in the first place. You’ve just come up with this brilliant encoding scheme for your data? What do you do when your ideal word size is actually 65 bits. Came up with the perfect GPU kernel code? Sorry, it actually requires 1 extra SIMD lane than you have, or one less.

Designing fast programs is hard because you need to think at the hardware level, and you know what? Those hardware designers definitely didn’t have your problem in mind when they made their chip.

Item 3: Testing fast programs is hard

Ugh. So you’ve spent all that time tuning this stallion of a war horse program? It only works sometimes but SEGFAULTS randomly? Well why don’t you unit test better?

Oh, right. If your code is structured to be unit-testable, it’s unlikely that it’ll be blazingly fast too. Unless you’re some sort of demigod that can write testable fast code. Your code will refuse to orient itself in any unit testable fashion, and sure enough, when you refactor the world because of some cache miss in some inner loop, all your tests will commit seppuku.

Item 4: Making programs fast is hard

Your profiling can be as useful as a boy scout compass on a steel ship. Your code runs great on your machine, but flows like molasses on your friends machine (with newer hardware). You manage to make your program fast, but the inclusion of one extra element in your data set causes some anomalous branching your code does slows everything to a halt again. Worst, your program is slow because you failed at the design and the thing you optimized to death turned out to be the least significant factor.

The future of writing fast programs

My hope lies in something similar to Haskell. Give me a smart compiler that can optimize the code for me, decide how to execute it, and do the right thing. That lets me manage my own memory in a type safe way and will complain loudly if there is a possibility of leakage. That issues a warning if I may incur a cache miss because my data isn’t aligned properly. Better yet, it aligns the data for me. Give me a compiler that is largely functional but that inlines aggressively so I can have unit testable code without needing to restructure the entire program around my tests. Best of all, give me a compiler that automagically vectorizes my code and compiles the relevant pieces to GPU kernel code, all while keeping memory bandwidth in mind and suggesting changes to allow me to push more code to the GPU if I so choose.

Until I get this magical compiler, I suppose I will get cozy with C++ in the meantime.