About caches:
Imagine a non-associative 4KB cache: A memory location can't be put anywhere in the cache, but only at the location that matches it's lowest 12 bits. So address 0x0000 and address 0x1000 share the same cache location. If you load a continuous area, like a piece of linear code, it doesn't matter, since that is, by definition, linear. If 0x0000 is in the cache, and 0x1000 is to be loaded, then 0x0000 must be removed, even if the rest of the cache is unused.
Now imagine a 16KB Cache, 4-way associative. You can think of it as 4 non-associative caches, 4KB each. If 0x0000 is loaded and 0x1000 is to be loaded, 0x1000 is put in the second "way". 0x2000 and 0x3000 can also be loaded, and only when 0x4000 is loaded, one of the 4 previous locations will be thrown out.
4-way is quite common. If your code+data fits into 64kb (even if split across 2-4 distinct memory blocks), practically every CPU since the Pentium 2 will fit it into it's L1 cache. Use a Linux machine and OProfile to get a detailed profile on what locations suffer the most cache misses.
GCC will obey the GCC_UNLIKELY flag and put that code path out of the way, arranging code flow for minimal branching. If you use PGO, you don't even need GCC_UNLIKELY -- the profile data will be used to decide which code path is the hottest (that was the very first use of PGO). Use an as new gcc as you can get, as the PGO features are still quite new and new profile-guided optimization steps are added all the time.