VOGONS


First post, by superfury

User metadata
Rank l33t++
Rank
l33t++

I'm want to try and start implementing 80286 timings in UniPCemu(to make it work better with 80286 timing-dependant software, like the AT BIOS).

I'm currently looking at the tables at the end of this document:
http://www.dmi.unict.it/~santoro/teaching/tfa … intel-80286.pdf

Strange thing is: some entries actually give more than two (as said at the start of the document) timings. It also has some "comments" column, but it doesn't explain what that column is used for? Anyone can explain those things to me?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 3 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

Looking at it again, I notice that the numbers mentioned in the "comments" column are the numbered comments of page 3-46 that apply to the instructions, not cycle counts(like you said)? Then the clock count can be correctly used for implementing the timings:D Although nothing seems to be said about exceptions? Are those just INT n instruction timings?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 4 of 15, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

I think the timings in that document are meant to be used to get a rough idea of how fast your inner loop will run, and to aid in writing optimized code (or optimizing compilers) - it's not meant to have enough detail to write a cycle-exact emulator. The details of the timings of prefetch queue interactions, external interrupts and so on are probably (as with 8088/8086) undocumented. The timing of INT is probably a reasonable approximation to that of an exception, though.

Reply 5 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've converted the entire table into a big array containing all those instructions, which are to be applied as filters and alternative versions(alternatives having higher priority than the normal/real mode counterparts. The filters are simply used to make a rule to apply to a range of instructions, by masking bits off before comparing the opcodes. This is done in e.g. the segment prefixes(middle 2 bits masked off), as well as byte/word variants with the same timing(low 1 or 2 bits masked off indicating word and order, like with opcodes 00-03)).

Now I still need to figure out how to apply the entire table to the instructions in some sort of big instruction table, while not making it too large(ENTER has about 256 sub-possibilities, most instructions only have one possibility and the various interrupt-related instructions have about 5 possibilities. This would require a huge lookup table of 256(instructions)*2(0F opcode)*9(ModR/m variant instruction)*256(maximum possibilities of cycle calculations)*2(Real/Protected mode)*3(Actual cycle data) for the lookup table currently. Thus about 6-7 MB for storing the entire table uncompressed. The current form is the amount of instructions and variants listed(including alternatives, like INT timings), with 11 bytes for each 'compressed'(by bitmask and parameters) entry.

https://bitbucket.org/superfury/unipcemu/src/ … ngs.c?at=master

Also, looking at the * information explanation, does this mean memory accesses(both byte and word-sized) take 1 cycle? Would this apply to the prefetch unit as well? Also, since the * is mentioned seperately, the memory cycles would still need to be added to the cycles mentioned? Probably as many memory cycles as used by the instruction(large instructions, like LGDT/LIDT would probably take 3 cycles to fetch LDTR from memory(16-bits for limit, 16-bits for base low, 8-bits for base high(or 16-bits for base high, with the high 8-bits being zeroed in the case of SIDT/SGDT?) So that would mean 2 cycles are already included(Overlapped?), and the final cycle is the third one?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 6 of 15, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

Thus about 6-7 MB for storing the entire table uncompressed. The current form is the amount of instructions and variants listed(including alternatives, like INT timings), with 11 bytes for each 'compressed'(by bitmask and parameters) entry.

Make sure that you're not trashing your CPU caches with big data tables - on modern CPUs it's often faster to calculate something like that with code instead of looking it up in a table just because of cache effects.

superfury wrote:

Also, looking at the * information explanation, does this mean memory accesses(both byte and word-sized) take 1 cycle?

I think I recall reading somewhere that a normal memory access on 286 is 3 cycles, but I might be wrong about that. It's a 16-bit bus, so an aligned word access should be the same as a byte access unless it's an 8-bit device that you're accessing (in which case a word access would need to be split into 2 byte accesses).

superfury wrote:

Would this apply to the prefetch unit as well?

Yes, from the point of view of everything outside the CPU (including all the bus/timing logic) there is no difference between a prefetch access and a non-prefetch access.

superfury wrote:

Also, since the * is mentioned seperately, the memory cycles would still need to be added to the cycles mentioned?

No, I think the 286 has an EU and a BIU which run concurrently, like the 8086. The timings in that document are best-case EU-only timings (it says "Assumptions [...] 1. The instruction has been prefetched, decoded, and is ready for execution.").

Reply 7 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've modified the MMU timings on the 80286 according to your 3 cycles per read/write. It will add those cycles when:
- The first byte of a word access or a byte access is performed(no word access and first word access byte). This is also set as the origin address. This is always added.
- The second byte of a word access is performed on a different word boundary than the first byte(simply comparing the bits of the current address in physical memory(x+1) with the origin address(see above), masking off bit 0 in both addresses. If the resulting address doesn't match(word not on word boundary), it adds 3 cycles. Otherwise, nothing is added. This ensures that:
Accessing byte X adds 3 cycles.
Accessing word-aligned word X (first and second bits 1-23 match on both addresses) adds 3 cycles.
Accessing non-word aligned word X (first and second bits 1-23 don't match both addresses) adds 6 cycles.
).

I've createn a little lookup table to be indexed into the full lookup table:

	//80286 timing support for lookup tables!
word timing286lookup[2][0x100][9][8]; //2 0F possibilities, 256 instructions, 9 modr/m variants, no more than 8 possibilities for every instruction. About 73K memory consumed(unaligned).

So the CPU can enter the 0F bit, instruction byte, modr/m byte into the first 3 fields to filter out the instruction used(Which consists of 0F(optional), instruction byte and modr/m byte(optional, only used with 0F protected mode instructions (the /x instructions, like LMSW, SMSW etc.) and GRP instructions (GRP1, GRP2, GRP3a, GRP3b, GRP4, GRP5 opcodes)). That way it can loop through the final 8 entries using the collected information, and apply the correct timings according to the settings in those 8 entries. When less than 8 possibilities are used, the remaining possibilities point to the 'default' handler(the lowest priority handler).

This is the current structure that holds each entry in the main table:

//Essentially, each instruction is expressed as a 0F,OPcode,modr/m set. This specifies the entries that apply(base on 0F opcode used, opcode executed and modr/m set when modrm_reg!=0.
//Next, the bits 1-5 specify different kinds of filters that specifies the variant to use, if specified. Variants have priority over the non-variants(bits 1-5==0).

typedef struct
{
byte CPU; //For what CPU(286 relative)? 0=286, 1=386, 2=486, 3=586(Pentium) etc
byte is0F; //Are we an extended instruction(0F instruction)?
byte OPcode; //The opcode to be applied to!
byte OPcodemask; //The mask to be applied to the original opcode to match this opcode in order to be applied!
byte modrm_reg; //>0: Substract 1 for the modr/m reg requirement. Else no modr/m is looked at!
struct
{
struct
{
word basetiming;
word n; //With RO*/SH*/SAR is the amount of bytes actually shifted; With String instructions, added to base count with multiplier(number of repeats after first instruction)
byte addclock; //bit 0=Add one clock if we're using 3 memory operands! bit 1=n is count to add for string instructions (every repeat). This variant is only used with string instructions., bit 2=We depend on the gate used. The gate type we're for is specified in the low 4 bits of n. The upper 2(bits 4-5) bits of n specify: 1=Same privilege level Call gate, 2=Different privilege level Call gate, no parameters, 3=Different privilege level, X parameters, 0=Ignore privilege level/parameters in the cycle calculation, bit 3=This rule only fires when the jump is taken. bit 4=This rule fires only when the L value of the ENTER instruction matches and fits in the lowest bit of n. 5=This rule fires only when the L value of the ENTER instruction doesn't fit in 1 bit. L is multiplied with the n value and added to the base count cycles.
//Setting addclock bit 2, n lower bits to call gate and n higher bits to 2 adds 4 cycles for each parameter on a 80286.
//With addclock bit 4, n is the L value to be specified. With addclock bit 5, (L - 1) is multiplied with the n value and added to the base count cycles.
} ismemory[2]; //First entry is register value(modr/m register-register), Second entry is memory value(modr/m register-memory)
} CPUmode[2]; //0=Real mode, 1=Protected mode
} CPUPM_Timings;

The priorities of the 8 possibilities are to be as follows:
- First entries having bits 1-5 set(special filters).
- FInally the entry having bits 1-5 not set(default handler).

The CPU execution unit can then simply lookup the ordered items, and process it's filters as appropriate. When any filter matches, it aborts and uses the specified real-mode or protected-mode register or memory timings (depending on whether modr/m is used(to specify register or memory timings) and otherwise the register timings are used. Bit 0 is used to add 1 cycle when 3 modr/m parameters are added (e.g. [BX+SI+DISP8]). Non-modr/m instructions default to the register timings.

Edit: I've extended the pointer lookup table a bit:

	//80286 timing support for lookup tables!
word timing286lookup[2][2][2][0x100][9][8]; //2 modes, 2 memory modes, 2 0F possibilities, 256 instructions, 9 modr/m variants, no more than 8 possibilities for every instruction. About 73K memory consumed(unaligned).

The mode is simply real and protected mode, the memory mode is register or memory, 0F is used or not, the instruction opcode byte itself, the 8 modr/m variants and no modr/m used(ignored) when 0, finally 8 different variations available for every instruction (which is based on the earlier filters).

Each entry points to the entry in the table + 1. The value zero is reserved for unused entries, which are to be ignored (and in this case, the default timing will be applied, which depends on the CPU. 80(1/2)86 default to 8086 timings. 80386+ currently uses the general timing (8 cycles per instruction that's undefined), just to use something at all (cannot be 0 cycles, as this can spin the CPU in an infinite loop, since the CPU times realtime using this number. If no time is added, the core loop will never finish, because it never reaches the current time(xns + 0ns < currenttimetoexecute, which always be true)).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 8 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've implemented and fixed the bugs that were left in the IBM PC AT 80286 (running at 6MHz exactly). After fixing some bugs(E9 JMP instruction using ED timings, and the table being interpreted incorrectly(modr/m reg filter is reg value+1, while 0 is ignoring the reg value)) and implementing the remaining requirements (call and interrupt gate detection), the AT BIOS now suddenly stops at step 11h?

Looking at the source code of the BIOS once again: it's the verify speed/clock refresh rates, which fails because the speed isn't correct (enough)?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 9 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

Anyone knows how many clocks (in 4.77MHz clocks) the IBM PC XT/AT 8247 DMA Controller needs to transfer 1 byte of data? It used to be set to 10 cycles per byte, but it seems the IBM PC AT BIOS doesn't like that speed (with the current CPU timings reporting a 0xF409 speed with 3 cycles per byte, it requires the result to be in the range of about F952 +/- 10%.

So there seems to be a problem with the PIT, since it's the one controlling the toggle? Is the clock speed of 6MHz correct?

Edit: Applying the memory read/write cycles (0, 3 and 6 as specified earlier) as well as the prefetch cycles (3 cycles for each byte read) seems to increase the count to F903, which is within the 10% range required by the BIOS. Anyone knows what else needs to be added? Applying the added 1 cycle mentioned in the manual (with the * comment for cycles) doesn't seem to change this value at all?

Last edited by superfury on 2016-10-20, 19:37. Edited 1 time in total.

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 10 of 15, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
superfury wrote:

Anyone knows how many clocks (in 4.77MHz clocks) the IBM PC XT/AT 8247 DMA Controller needs to transfer 1 byte of data?

It's normally 4 on PC/XT. There are some flags in the DMAC to make it 3 (or even 2 during a burst when the high byte of the address doesn't change) but I haven't actually tried this so I'm not sure if it works in practice.

Reply 11 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

Since the latest modifications allowed the IBM AT BIOS to boot again, it seems the speed is indeed reasonably close. Strange thing is however: MS-DOS 5.0 still complains about the "Divide overflow" for every program that's ran(dir, programs). Entering an invalid command(program) tells me it didn't find the file(???). Trying to run the AT BIOS diagnostics disk from the hard disk clears the screen (with the cursor in the top left) and hangs the CPU? Trying to boot the 360KB diagnostic disk gives me an error message that the disk is unbootable?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 12 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

I've changed the CPU speed to 8MHz, as seems to be the case according to the MIPS and some documentation(some say 6MHz instead?). The results in CX=F681 instead of the F8A7 minimum required by the IBM AT revision 2 BIOS? Apparently it's one of those two revision 3 motherboards I need for that, but it seems to use something called one waitstate RAM? How does this one-waitstate RAM affect memory cycles?

Does this simply mean: add one cycle for every byte/word/unaligned word half access?

Adding one cycle for every of those byte/word/unaligned word half access increases the CX value to F785, which is still out of range?

Edit: The new count with wait-state memory is F7B5, not F785. But it's still too low for the IBM PC AT 6MHz BIOS?

Edit: Just tried again with the IBM AT 8MHz revision 3 type 1 BIOS. That one checks out properly (within range, just a bit faster(higher CX) than what it's minimum required memory refresh speed needs).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 13 of 15, by Alegend45

User metadata
Rank Newbie
Rank
Newbie

reenigne is partially right about the bus cycles of the 286. They're actually 2 cycles + wait states. On the AT, there is one wait state, so each bus cycle is 3 cycles. Also, some ATs were 6 MHz, and some were 8 MHz.

Reply 14 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

The current emulation emulates a pretty cycle-accurate 80286 at 8MHz(tested using the first revision 3 BIOS from minuszerodegrees). It uses 3 cycles each bus cycle. The BIOS seems to work without many problems(other than hardware problems being reported due to inaccurate hardware itself, such as untimed floppy disk and the keyboard not being accurate enough).

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io

Reply 15 of 15, by superfury

User metadata
Rank l33t++
Rank
l33t++

It seems that it was using 3 cycles + 1 waitstate memory accesses. I've just modified this to actually be 2 cycles + 1 waitstate. Now the CPU runs a bit faster.

I've just tried booting Windows 3.0 again on my 80286 emulation. Now, instead of ending up at opcode 66h instructions, it ends up executing 'NULL'-memory(memory that's zeroed) when I fire up the debugger after the windows booting process stops responding.

Edit: Just tried it again. The debugger says the last opcode was 0xF3(REP/REPZ)?

Edit: It seems to be executing 00h opcodes after looking at the program executing through the Visual C++ debugger. One strange thing: CS:IP seems to be unchanging?

Author of the UniPCemu emulator.
UniPCemu Git repository
UniPCemu for Android, Windows, PSP, Vita and Switch on itch.io