Falcosoft wrote on 2025-05-27, 17:19:
swaaye wrote on 2025-05-27, 17:06:
JC was making hardware obsolete with the best of them back in the day, even with supremely optimized code.
Yeah, low level (over?) optimized code is usually microarchitecture specific. Like in case of Quake 1/2 that were optimized specifically for the Pentium's pipelined FPU that made contemporary Cyrix CPUs obsoltete...
Back in the day Carmack and even Microsoft would have multiple code paths depending on feature support.
Eg Windows xp required an instruction from a pentium, but 2k and NT would use if it existed, or take the longer path if it didn't.
Chgcmp something like that. Someone patched xp to support 486s not that long ago.
One thing that might be interesting is how Ai affects compiler optimisation of code.
Compiling might be a bit of a circular process instead of one and done.
Ie, you hit compile, and there could be a dozen ways to get working code, and you basically just hope the compiler gets it right, but it lacks real understanding of which code is the most used.
The "main stream" as it were.
Presumably, someone will build an AI that profiles the code during runtime, and loops back to test recompiled variants before settling on a fixed binary. It should also make these alternate branches like avx2 vs no avx2 more automatic and optimal.
It matters because even though it doesn't really get much attention, processors spend a lot of, if not most of their time on logistics, not math.
Load this memory, save this memory, send this memory.
The biggest hitters in optimisation tend to be noticed in the logistics, rather than the selecting of the perfect algorithm (though they tend to go hand in hand).
It's a math factory with production lines. It is more efficient to load one piece of paper and do multiple calculations on it than to load a new page for every calculation.
If you have a good idea of which code is most used, and therefore which memory is most important, you can tweak the order of the code like a flowchart - where the chart is arranged around the main path with the lesser used branches off to the side.
Load the main path into cache, and it is super fast. Have the exact same process organised differently and it keeps having to load new parts, slowing the whole factory down.