VOGONS


Reply 20 of 49, by Scali

User metadata
Rank l33t
Rank
l33t

You said this:
"PCE, the emulator that I am basing my own emulator on, is what is called a cycle-accurate emulator. That is, it implements the correct clock cycles for the 8086 and 80186 CPUs."
Firstly, PCE is not what we call a cycle-accurate emulator.
Secondly, no, it does not implement the correct clock cycles for 8086 and 80186 CPUs. Not even if we disregard the rest of the system.
It takes the cycle ratings from the manual, and implements them as 'absolute truth', without paying attention to how the manual intended these cycles to be interpreted. Namely, these are 'best-case', and do not account for any additional cycles added by the BIU, the prefetch-buffer being empty, EA having to be calculated etc.
See the manual here: http://matthieu.benoit.free.fr/cross/data_she … sers_Manual.pdf
On page 2-50. I couldn't find much of this behaviour in any emulator code.

So, your statements were and are inaccurate.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 21 of 49, by SoftPCMuseum_

User metadata
Rank Newbie
Rank
Newbie
Scali wrote:
You said this: "PCE, the emulator that I am basing my own emulator on, is what is called a cycle-accurate emulator. That is, it […]
Show full quote

You said this:
"PCE, the emulator that I am basing my own emulator on, is what is called a cycle-accurate emulator. That is, it implements the correct clock cycles for the 8086 and 80186 CPUs."
Firstly, PCE is not what we call a cycle-accurate emulator.
Secondly, no, it does not implement the correct clock cycles for 8086 and 80186 CPUs. Not even if we disregard the rest of the system.
It takes the cycle ratings from the manual, and implements them as 'absolute truth', without paying attention to how the manual intended these cycles to be interpreted. Namely, these are 'best-case', and do not account for any additional cycles added by the BIU, the prefetch-buffer being empty, EA having to be calculated etc.
See the manual here: http://matthieu.benoit.free.fr/cross/data_she … sers_Manual.pdf
On page 2-50. I couldn't find much of this behaviour in any emulator code.

So, your statements were and are inaccurate.

OK, so maybe then it isn't 100% to the letter, but that's still far better than many other emulators which don't even go so far as to emulate the clock cycles at all (let alone even the ones from the manual). I would much rather have the clock cycles taken from the manual than simply not emulated at all.

Reply 22 of 49, by Scali

User metadata
Rank l33t
Rank
l33t
SoftPCMuseum_ wrote:

OK, so maybe then it isn't 100% to the letter, but that's still far better than many other emulators which don't even go so far as to emulate the clock cycles at all (let alone even the ones from the manual). I would much rather have the clock cycles taken from the manual than simply not emulated at all.

You mean it's less completely horrible than most other PC emulators out there, but still pretty damn horrible compared to emulators for other systems, such as VICE and UAE.
I think the world of PC emulation just has an attitude problem. They don't know how to build proper emulators, and they don't know that they don't know how to build proper emulators.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 23 of 49, by Azarien

User metadata
Rank Oldbie
Rank
Oldbie

I think the world of PC emulation just has an attitude problem.

I think you don't understand the attitude and the needs of PC emulation.

There is no need for cycle-exact emulation of the whole machine, because there's simply no one true PC machine, and it has been such from the day two.

Because of that, almost all of PC software is speed-independent (at least for some range of speeds) so most of the time "as fast as can be" emulation is desired. And when it is too fast, abstract "cycles" dosbox-style do the work.

Reply 24 of 49, by Scali

User metadata
Rank l33t
Rank
l33t
Azarien wrote:

I think you don't understand the attitude and the needs of PC emulation.

Yes I do.

Azarien wrote:

There is no need for cycle-exact emulation of the whole machine, because there's simply no one true PC machine, and it has been such from the day two.

Yes, but we are at day one, not day two.

Azarien wrote:

Because of that, almost all of PC software is speed-independent (at least for some range of speeds) so most of the time "as fast as can be" emulation is desired. And when it is too fast, abstract "cycles" dosbox-style do the work.

The fact that it happens to work for some amount of software is no excuse.
Even so, there's a lot of stuff that can trip up DOSBox, even if not everything is directly related to cycle-exact emulation.
It is about accurate emulation in general. Too many corners are being cut.

Take an example from UAE for example. They have 'day one' Amigas, the group of 500/600/1000/2000, which are all the same down to the cycle-level.
Then there's 'new' Amigas, where you can choose to only emulate certain parts of the hardware accurately, and run eg the CPU emulation faster, to mimic an Amiga with a turboboard.
Something like that should also exist for PCs... A proper cycle-exact emulator for early 5150/5155/5160 and 100% compatibles. And then another option for turbo XTs, ATs and beyond.

It is simply inexcusable that for the most popular personal/home computer platform out there, nobody seems to be capable of delivering a proper emulator. The C64/Amiga people apparently have much higher standards/skills.
The ideal emulator can run any software that can run on physical machines. VICE and UAE clearly strive to that, and are 99.999% there already. Why is there no such equivalent in the PC world?

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 25 of 49, by superfury

User metadata
Rank l33t++
Rank
l33t++

So if I understand all this correctly it's simply a case of missing documentation? If there was a proper documentation with exact cycles for all instructions and hardware, a fully cycle accurate PC emulator should have long existed by now? But it doesn't, simply because there is no (proper) documentation that has enough detail to fully implement it 100% accurate?

How could this even have happened, considered that the PC platform is among the most used atm? Simply IBM being lax in documentation? Or is this more a thing of 'open' closed source?(open in the case of the architecture, closed in the case of the individual parts, e.g. CPU)

Also, why hasn't anyone just cracked open the 8086/8088 and just automatically retrieved all formulas etc. required by looking at the electronics using some computer program to extract that data (like they did with the OPL2 tables)? Simply because of copyright with a CPU nobody even uses anymore? Seeing as newer chips have been decrypted by now (or at least partially) this seems unlikely(don't know any myself, but I'm sure there are some, considering the age of the CPU itself)? I mean, they know more about modern ARM CPUs than they even know of the 30 year old 8086 which is implemented in all CPUs nowadays as a compatibility mode(real mode/virtual 8086(V86) mode)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 26 of 49, by Scali

User metadata
Rank l33t
Rank
l33t
superfury wrote:

So if I understand all this correctly it's simply a case of missing documentation? If there was a proper documentation with exact cycles for all instructions and hardware, a fully cycle accurate PC emulator should have long existed by now? But it doesn't, simply because there is no (proper) documentation that has enough detail to fully implement it 100% accurate?

That's a fallacy. For C64 and Amiga the hardware is not documented down to every quirk by the manufacturer either.
A lot of tricks, techniques, bugs and quirks were never even known by the designers of the machines, and were discovered by people tinkering with the hardware and unlocking new features (much like what we did with 8088 MPH... which is only the beginning. We've uncovered some more features/tricks/bugs since, and people are likely to find more over time).
I think it is mainly because in the heyday of the 8088-based PCs, all the 'tinkerers' were playing with C64, Amiga and similar, and didn't care about PCs, because it was a far less powerful platform.
It's a cultural difference.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 27 of 49, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

I think the world of PC emulation just has an attitude problem. They don't know how to build proper emulators, and they don't know that they don't know how to build proper emulators.

Scali wrote:

It is simply inexcusable that for the most popular personal/home computer platform out there, nobody seems to be capable of delivering a proper emulator. The C64/Amiga people apparently have much higher standards/skills.

Ok, Scali, that is it, challenge accepted! 😀

I am going to fork my emulator code, and have it just emulate the IBM PC. No PCJr, 186, 286, 386, AdLib, etc. Just the IBM PC. My code already emulates the bus , the prefetch queue (on bus idle cycles), memory wait states, CGA memory vs conventional ram speed, 8bit bus, instruction cycle timing (including accurate EA calculation timing). There are still quite a few things which I do not do at the moment, and lack support for (like CGA per pixel timing emulation and better instruction timing). I will work on fixing those.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 28 of 49, by superfury

User metadata
Rank l33t++
Rank
l33t++

So essentially someone only still needs to find those ranges exact formulas(like mul/div)? Since the others are values given, those formulas should be able to fill those instruction information gaps, thus complete accurate cycle counts(excluding bus etc.)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 29 of 49, by Scali

User metadata
Rank l33t
Rank
l33t
superfury wrote:

So essentially someone only still needs to find those ranges exact formulas(like mul/div)? Since the others are values given, those formulas should be able to fill those instruction information gaps, thus complete accurate cycle counts(excluding bus etc.)?

Well, I think for starters, the rest should be emulated properly 😀
I mean, most emulators don't even try to do pixel-accurate emulation or sync the different components properly (CRTC, PIT, DMA controller etc).
Before reenigne's NTSC-composite patch, no emulator even bothered to emulate CGA composite mode accurately.

So even if there may be some details that we don't know exactly yet, there's a lot of stuff that we DO know, but isn't emulated properly anyway.
Once you have that going, you should already be 95% there. I don't even think that 8088 MPH would need 100% cycle-accurate CPU emulation. Aside from the fact that the only speed-sensitive effects only use a few instructions, you can probably get away with being a cycle off here and there. The crazy thing about CGA waitstates is that they sometimes 'auto-calibrate' your code. That is, you may hit a waitstate that you normally wouldn't, which slows down your routine, so effectively it runs the same speed again.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 30 of 49, by superfury

User metadata
Rank l33t++
Rank
l33t++

So I'm already pretty far with my emulator? Except for 80(1)8X cycle accuracy and bus speeds, everything is already synchronized. It's atm just still a bit on the slow side to run full speed (especially audio emulation of the PC speaker and VGA CRT emulation being slow enough to make audio stutter). Although I'm busy optimizing the VGA (Profiling using gdb told me that the pixel renderer of active display (VGA_ActiveDisplay) got 1.5 times 0.01ms per call. I got this down to 0.00(according to the gdb profiler), although the VGA doesn't seem to go any faster. It still slows down to 10-15FPS for some reason during some parts(graphics) of 8088 MPH).

Also even though it should be rendering a lot faster, it still doesn't go higher than 37FPS with automatic host CPU speed adjustment (adjusting the VGA rendering speed to let the CPU emulation run at ~100% if possible).

This might also be simply because of the low pass filter (currently lowered to ~16kHz(1.19Mhz/72) for better accuracy with 8088MPH's 72-sample algorithm) that's being heavy on the CPU emulation, since it's called at 1.19MHz frequency.

Also, it slows down a lot during profiling with -g -pg flags.

And I also notice that it still becomes slower without mounting input (mouse&keyboard, hiding the mouse, using middle mouse button or both mouse buttons to (un)mount). When mounted it runs a lot faster for some reason? Some bug in SDL that causes heavier reads from SDL Events when mouse is visible or not grabbed?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 31 of 49, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

@superfury: if you are compiling stuff in Windows I would like to recommend AMD Code XL. It is free and available from AMD's website and this is what I use a lot for profiling. It can show cycles spent at instruction level in your program, including call graph information and it does work on release builds too. I have nothing against gdb profiling, I am just mentioning AMD CodeXL in case you have not heard of it.

@Scali: I was flying coast to coast yesterday and so I had time in the airplane to recode my core emulation to do something like this:

- I separated the EU and BIU emulation and I am not cheating anymore in terms of prefetching.
- for example if the instruction takes 3 bytes but only the first is available in the prefetch buffer, the EU goes to sleep keeps asking the prefetch: do you have my 2 extra bytes?. The prefetch might not have this in the next 8 cycles because the bus might not have idle time. (I implemented this by keep trying to execute the instruction and if the prefetch returns false, I bail out early).
- when that happens, I execute the instruction then I wait X cycles (X=execution time)

What this does not do, and I still need to code: for instructions that write out data, this is really done at the end of instruction execution cycles not at the beginning. So if an instruction takes 16+EA cycles (take AND memory, register) it really takes 12+EA and 4 more to write the byte out. So I could wait 12+EA in EU then before I wait 4 more cycles I first tell the BIU to write out a byte. This would also mean the prefetch activity would more closely match that of the real CPU.

This is also true for instructions that READ data. So "AND memory, register" spends in reality cycles like this: EA + 4 (read byte) + 8 (execute) + 4 (write byte). So my EU has 4 stages now: new_instruction, execution, read_data and write_data. I am hoping to keep the bus busy at correct times with this scheme. Unfortunately this also means that a lot of instruction decodes would have to be rewritten somehow as I used to execute each instruction atomically. 🙁

Last edited by vladstamate on 2016-03-08, 16:20. Edited 2 times in total.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 32 of 49, by Scali

User metadata
Rank l33t
Rank
l33t

Yea, that sounds more like it!
Indeed, you'd want to emulate the read and write operations of the CPU in the exact CPU-cycle that they occur as well.
Namely, these can be affected by other devices that also use the bus, such as the DMA controller, or the CGA card generating wait states on video memory access.

On eg C64 it is of vital importance that reads and writes occur at the exact cycle, because all IO is memory-mapped. So when you want to perform special VIC-II trickery, the reads and writes of its registers need to be done at the exact cycle. Also, when the VIC-II steals cycles for its sprites or 'bad lines', it also has to steal the right cycles 😀

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 33 of 49, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Finally got more implementation done and this is a trace of my emulator running the BIOS code after power on. Few instructions executed at computer boot up, from BIOS.
Explanations: first column is cycle number, second column is BIU activity (the numbers represent how full is the prefetch buffer) and third column is EU activity.

Notice how we do not start executing until the prefetch buffer has at least the opcode in, then after the EU gets it it realizes it needs 4 more bytes of data (since it is a JMP seg:off instruction). But those bytes are not yet in the prefetch buffer so it waits until that is filled up, then it proceeded executing . At cycle 21 the prefetch buffer is cleared (since this is a jump). At cycle 38 a new instruction is obtained from the prefetch buffer (this one is a MOV AX, imm16) but the prefetch buffer already has the imm16 so the execution proceeds instantly. It takes 4 cycles to execute that (during which since there is no bus activity, the prefetch fills back up). At cycle 42 a new opcode is being read. And so on.

Not done: DMA refresh, I need to understand how that affects bus activity.

Cycle  1 
Cycle 2
Cycle 3
Cycle 4 prefetch (1/4) new opcode 0xea
Cycle 5 waiting for data
Cycle 6 waiting for data
Cycle 7 waiting for data
Cycle 8 prefetch (1/4) waiting for data
Cycle 9 waiting for data
Cycle 10 waiting for data
Cycle 11 waiting for data
Cycle 12 prefetch (2/4) waiting for data
Cycle 13 waiting for data
Cycle 14 waiting for data
Cycle 15 waiting for data
Cycle 16 prefetch (1/4) waiting for data
Cycle 17 waiting for data
Cycle 18 waiting for data
Cycle 19 waiting for data
Cycle 20 prefetch (2/4) waiting for data
Cycle 21 instruction time
Cycle 22 instruction time
Cycle 23 instruction time
Cycle 24 prefetch (1/4) instruction time
Cycle 25 instruction time
Cycle 26 instruction time
Cycle 27 instruction time
Cycle 28 prefetch (2/4) instruction time
Cycle 29 instruction time
Cycle 30 instruction time
Cycle 31 instruction time
Cycle 32 prefetch (3/4) instruction time
Cycle 33 instruction time
Cycle 34 instruction time
Cycle 35 instruction time
Cycle 36 prefetch (4/4) instruction time
Cycle 37 new opcode 0xb8
Cycle 38 instruction time
Cycle 39 instruction time
Cycle 40 prefetch (4/4) instruction time
Cycle 41 instruction time

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 34 of 49, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

Seems I should pop in here more often - I've been missing an interesting conversation! Glad to see other people are thinking about and working on this stuff.

For comparison, here's what my bus sniffer outputs for a far jump instruction with the prefetch queue initially filled:

20FFF .p...  00F1D FF 00 FC .......
20FFF .p... 00F1D FF 00 FC .......
20FFF Ip... 00F1D FF 00 FC ....... I
20FF8 .C... 00F1D FF 00 FC ....... T1
00F1E SC... 00F1E FF 00 FC ....... T2 S
20F1E SC... 00F1E FF 00 FC ..r.... T3 S
20F00 Sp... 00F1E 00 00 FC ..r.... T4 00 <-f [ 00F1E] S
20F00 .C... 00F1E 00 00 FC ....... T1
00F1F .C... 00F1F 00 00 FC ....... T2
20F1F SC... 00F1F FD 00 FC ..r.... T3 S EA9B05A800 JMP 00A8:059B
20F98 .p... 00F1F 98 00 FC ..r.... T4 98 <-f [ 00F1F]
20F98 .p... 00F1F 98 00 FC .......
20F98 .p... 00F1F 98 00 FC .......
20F98 .p... 00F1F 98 00 FC .......
20F98 Ep... 00F1F 98 00 FC ....... E
20F98 .C... 00F1F 98 00 FC ....... T1
0101B .C... 0101B 98 00 FC ....... T2
2101B .C... 0101B FF 00 FC ..r.... T3
21098 .p... 0101B 98 00 FC ..r.... T4 98 <-f [ 0101B]
21098 .C... 0101B 98 00 FC ....... T1
0101C .C... 0101C 98 00 FC ....... T2
2101C IC... 0101C FF 00 FC ..r.... T3 I 98 CBW
21098 .p... 0101C 98 00 FC ..r.... T4 98 <-f [ 0101C]
21098 .C... 0101C 98 00 FC ....... T1
0101D .C... 0101D 98 00 FC ....... T2
2101D IC... 0101D FF 00 FC ..r.... T3 I 98 CBW

Column 62 shows the prefetch queue operation that is being indicated by the CPU's status pins in that cycle: "E" means queue is emptied, "I" means EU grabbed first byte of instruction from prefetch queue, "S" means EU grabbed subsequent byte of instruction from prefetch queue. You can also see what the bus is doing on each cycle (T1-T4 for each fetch).

Btw, if you want to do your own experiments with this you do not need your own PC/XT to do so: just follow the instructions at http://www.vcfed.org/forum/showthread.php?319 … 5963#post385963 .

Reply 35 of 49, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie

Thank you Reenigne. One takeaway from that is that it seems the EU grabs one byte per cycle from the prefetch buffer (the S in your text). However that seems to be baked in instruction cycle timing. Counting the cycles it took for you to execute the JMP using a filled up prefetch is 19 cycles (the manual says 16). But I think that is because T3 is when instruction decoding happens.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 36 of 49, by Scali

User metadata
Rank l33t
Rank
l33t
vladstamate wrote:

Counting the cycles it took for you to execute the JMP using a filled up prefetch is 19 cycles (the manual says 16).

Yup, this is why I said the timings in the manual can't be used as-is 😀
If only the 8088 were as fast as the timing table indicates. It's the extra cycles that aren't counted in the manual that hurt.
It's nice though, when we made 8088 MPH, a lot of people didn't believe that the 4.77 MHz 8088 was as slow as a 6502 at 1 MHz. Now everyone can see for themselves (for example, a 6502 can do an absolute jmp in 3 cycles, and an indirect one in 5 cycles).

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 37 of 49, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

I don't think there's any sensible way to measure the timing of this instruction that doesn't give 19 cycles on 8088 - if there's an offset between the "I" and when the instruction decoding happens, it'll be the same for all instructions, so will be cancelled out when measuring "I to I". Which manual says 16? The one Scali linked to above says 15, but I think that might be an 8086 timing. The instruction will almost certainly be quicker on 8086 than 8088 because it fits entirely into the prefetch queue on 8086 but not on 8088, though I can only see two cycles in the trace above where the EU could plausibly be waiting for the BIU (the ones between the third and fourth "S").

One way to come up with 15 cycles is to take the 19 cycles and attribute 4 cycles to the fetch of the next instruction (which would already be in the prefetch queue for non-jump instructions).

Reply 38 of 49, by vladstamate

User metadata
Rank Oldbie
Rank
Oldbie
reenigne wrote:

One way to come up with 15 cycles is to take the 19 cycles and attribute 4 cycles to the fetch of the next instruction (which would already be in the prefetch queue for non-jump instructions).

Yes, something like that. The cycles in the manual are intended to account for stuff like that, because a "JMP addr" followed by "MOV AX, imm" together should take the sum of the two instructions in cycles and not the sum + prefetch buffer filling. I think the wording in the Intel manual is quite clear on that (i.e. it is somewhat safe to add cycles for consecutive instructions to get within 5% of real cycles). Page 2-51 (right top) of the family users manual.

YouTube channel: https://www.youtube.com/channel/UC7HbC_nq8t1S9l7qGYL0mTA
Collection: http://www.digiloguemuseum.com/index.html
Emulator: https://sites.google.com/site/capex86/
Raytracer: https://sites.google.com/site/opaqueraytracer/

Reply 39 of 49, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

I'm inclined to take that whole manual with a big pinch of salt since the very first entry in the table is wrong: AAA is 9 cycles, not 4 (as is AAS).

Here are some more interesting things that I noticed today:
1) The BIU never seems to be idle for just one cycle. It's almost like it takes a cycle to shut down and then a cycle to start up again.

2) I tried starting an "IN AL,DX" instruction with the prefetch queue in various different states. The number of cycles from the "I" cycle of the instruction to the T1 cycle of the IO was 3 unless the the "I" cycle occurred on the T1 of a fetch, the T3 of a fetch or two cycles after the T4 of the previous fetch. In these cases the T1 cycle of the IO was 4 cycles after the "I" cycle of the instruction.

3) Similarly with an "XLATB" instruction - it's normally 7 cycles from the "I" cycle to the T1 of the [BX+AL] memory access, except for three cases. Again these are the when the "I" cycle is the T1 or T3 of a fetch or two cycles after the T4 of the last fetch. However, for this instruction these cases are a cycle shorter (6 cycles) instead of a cycle longer!

I don't know why any of these things happen - I can't see any obvious reason for it. There must be a lot of complexity to the interface between the EU and BIU that I don't yet understand.