@superfury: if you are compiling stuff in Windows I would like to recommend AMD Code XL. It is free and available from AMD's website and this is what I use a lot for profiling. It can show cycles spent at instruction level in your program, including call graph information and it does work on release builds too. I have nothing against gdb profiling, I am just mentioning AMD CodeXL in case you have not heard of it.
@Scali: I was flying coast to coast yesterday and so I had time in the airplane to recode my core emulation to do something like this:
- I separated the EU and BIU emulation and I am not cheating anymore in terms of prefetching.
- for example if the instruction takes 3 bytes but only the first is available in the prefetch buffer, the EU goes to sleep keeps asking the prefetch: do you have my 2 extra bytes?. The prefetch might not have this in the next 8 cycles because the bus might not have idle time. (I implemented this by keep trying to execute the instruction and if the prefetch returns false, I bail out early).
- when that happens, I execute the instruction then I wait X cycles (X=execution time)
What this does not do, and I still need to code: for instructions that write out data, this is really done at the end of instruction execution cycles not at the beginning. So if an instruction takes 16+EA cycles (take AND memory, register) it really takes 12+EA and 4 more to write the byte out. So I could wait 12+EA in EU then before I wait 4 more cycles I first tell the BIU to write out a byte. This would also mean the prefetch activity would more closely match that of the real CPU.
This is also true for instructions that READ data. So "AND memory, register" spends in reality cycles like this: EA + 4 (read byte) + 8 (execute) + 4 (write byte). So my EU has 4 stages now: new_instruction, execution, read_data and write_data. I am hoping to keep the bus busy at correct times with this scheme. Unfortunately this also means that a lot of instruction decodes would have to be rewritten somehow as I used to execute each instruction atomically. 🙁