Reply 40 of 227, by Kronuz
- Rank
- Member
wrote:L1 cache (or L2 cache, for that matter) is quite straightforward: […]
L1 cache (or L2 cache, for that matter) is quite straightforward:
Imagine you load a value from memory into a CPU register. 3 cases can happen:
1) The value is not in any cache. The CPU needs to access the system bus/RAM chips and wait for the data. Worth about 40(!) CPU clock cycles on current CPUs.
2) The value is in the L2 cache. No bus access neccessary, but still ~15 cycles.
3) The value is in the L1 cache. 3-5 cycles.In each case, L1 and L2 cache then contain the loaded value. Some other (older) value will be thrown out, if neccessary.
You can see, memory-intensive code is highly dependent on cache. The CPU uses automatic prefetch to load 32 or even 64 bytes at a time, so when you access a byte, chances are high that nearby bytes are already in the L1 cache.
A clever loop arrangement tries to acces bytes that are near each other, and arranges calculation and loading so that while the CPU is waiting the above 40 cycles, it has some number-crunching on already loaded values to do during that time. HT technology is basically an attempt at doing exactly this - while one thread may access memory, another one can do some math. That's why well optimized apps do not profit from HT, they already see to that themselves.
Thanks Moe 😀
It's funny, because that's what I though L1 and L2 cache was, but I didn't know it for sure, and I didn't know it made such a big difference (40 cycles, that's a lot)... that also goes for code, right, i.e. code is loaded to cache as well I guess.
Also, I suppose when the CPU caches memory (as you said, the nearby bytes) it caches the following bytes, not the previous ones, and that would explain a performance degradation I saw while profiling my patch, when instead of going forward in the line looking for differences I was going backwards; for instance I was trying 'mem--; for(int i=val; i>0; i++) mem;' 'cause it logically should be faster than 'for(int i=0; i<val; i++) mem;' but it isn't so... I suppose this is due the fact that when the CPU loads 'mem', it probably loads from about '&mem' to '&mem[i+64]' to the cache, but it doesn't know anything about the bytes that are before that; hence every cycle it has to access the bus as you said to load the byte before the current one... otherwise I can't explain the performance degradation on that code... what do you think?
Kronuz
"Time is of the essence"