Raster bar implementation on Amstrad PC1512 \ VOGONS

Reply 1 of 26, by reenigne

Posted on 2020-02-17, 11:00

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 661
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

This is really cool! I didn't know that the display enable bit was replaced with a toggle in the PC1512. I was surprised by this at first but it actually makes sense. The purpose of the display enable bit was to avoid snow, and the PC1512's display adapter doesn't suffer from snow. So you want to be compatible with snow-avoiding software while making that software run as fast as possible. Returning a constant value would stall software that waits for the other value. Toggling the bit fixes that problem while also being much faster than implementing display enable properly.

The hsync wait state is really interesting too - I'm not sure why they did that. I wondered if it might be to do with DRAM refresh but the CRTC's accesses to VDU RAM should take care of that the same way that it does on the CGA. Anyway, brilliant that you managed to utilise it to write self-synchronising raster-synchronised code.

However - I think there is an easier way that you are missing. I haven't tried it, but I can't think of any reason why it wouldn't work. The trick is noticing that the CGA clocks and the PIT are both driven from the same 14.318MHz crystal (in the IBM PC everything is driven from this crystal but on the PC1512 the CPU is driven from a separate 24MHz crystal, divided by 3 to get the CPU clock). If you program the PIT to generate IRQ0 (interrupt 8) every 76 PIT cycles, you will get an interrupt at the same horizontal position on each scanline. Now, the interrupt overhead is quite high so that doesn't leave a lot of room for other code to run but you don't need to do very much - just restore the stack, re-enable interrupts, acknowledge the interrupt, output the new background colour to port 0x3d9 and do a HLT so that the next interrupt occurs without jitter. You can do all that on a 4.77MHz 8088 so an 8MHz 8086 shouldn't even work up a sweat.

You still have to synchronise the effect with the vertical raster but you can do that by waiting for the vsync before you set up your interrupt. There is a period of time between detecting the vsync and completing the PIT setup that depends on the CPU speed (so you need to tune it for this machine if you want the palette changes to reliably occur during horizontal retrace) but other than that the code should be portable to pretty much any CGA clone (as long as it's fast enough to run your interrupt routine within those 76 PIT cycles).

For bonus points, note that you can reprogram the PIT period without resetting it (and the new count will take effect after the following interrupt). That means that with care you can get interrupts only on the scanlines where you actually want to change the palette register. Then your raster bars can be done in the background using very little CPU time, while you other things in the foreground. Though that does mean that you will get jitter which may mean the palette changes occur in the active area (especially when using long running instructions like MUL and DIV in the foreground thread).

Reply 2 of 26, by wbhart

Posted on 2020-02-17, 17:34

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

What you say about the toggle bit is almost certainly the case. That all occurred to me too. I was going to mention it in the video, but it started to get a bit technical.

I did actually at some point consider using the PIT. As I mentioned in the video, I actually got into this by trying to optimise the amount of work that could be done between writes to CGA memory, and it kind of evolved from there.

But I later realised the PIT could work, but I think interrupts need to be on for that right? I reasoned that this might just cause too much jitter. There's already quite a bit of jitter, which is barely hidden by the horizontal retrace time. Needless to say, I have interrupts off for this code.

Actually, if you look at the very last effect in the video, you can see I didn't get it quite right and there is actually quite a bit of tearing in the rasters on the right hand side. I only noticed after YouTube had tried to upload the video three times, so I didn't go back to fix that.

But you could be right. The PIT might work just fine.

That final trick is a good one though. That might be enough to make it work reliably, and would certainly be better if you wanted to do something else in the foreground (which you more or less can't currently).

YouTube Channel - PCRetroTech

Reply 3 of 26, by reenigne

Posted on 2020-02-17, 17:43

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 661
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

wbhart wrote on 2020-02-17, 17:34:

I did consider using the PIT. As I mentioned in the video, I actually got into this by trying to optimise the amount of work that could be done between writes to CGA memory. I later realised the PIT could work, but I think interrupts need to be on for that right? I reasoned that this would just cause too much jitter. There's already quite a bit of jitter, which is barely hidden by the horizontal retrace time. Needless to say, I have interrupts off for this code.

If nothing else is running (interrupt routine ends with HLT) then the jitter should be no more than a couple of pixels (i.e. one CPU cycle). Otherwise the maximum jitter is the running length of the longest running instruction (or sequence of instructions that run with the interrupts off).

If other interrupts are a concern, you can mask off everything but IRQ0 in the PIC's IMR.

Reply 4 of 26, by wbhart

Posted on 2020-02-18, 00:31

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

It does sound like it is worth giving a try. Hopefully I'll find some time at some point.

First I want to investigate whether patterns that are longer than a single raster might be better than what I have. I'm not sure how to check that just yet, but I have a strategy that might work.

Some of the numbers still don't add up. In particular, the figures in the reference manual still seem a little hard to reconcile with reality. What I'm concerned about is that I might have a pattern which matches well enough with every raster, but which is far from optimal due to a longer pattern with much wider gaps that only occur every third raster, say.

7 stosws + 21 nops seems like it should be less than 509.6 cycles (even with all the conversion costs, the 8 bit bus at 4MHz and the CGA wait states), especially given that the nops should be happening during the wait states. And stosb's only seem to make things worse, which seems to prove the stosw's are not concealing long strings of wait states.

Once I get to the bottom of that, I might return to the raster bars again. Getting four raster bars seems like a worthy challenge.

I had also considered that maybe the 46 cycles of wait states might be due to DRAM refresh on the CGA card. But I also don't understand why this would be.

YouTube Channel - PCRetroTech

Reply 5 of 26, by reenigne

Posted on 2020-02-18, 20:08

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 661
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

Ah, I think I've figured it out. The VRAM in the PC1512 consists of two 41264-15 chips, each 64k x 4 bits. This is dual-port RAM designed for video buffers, with a serial output for the raster as well as random input/output for the CPU. The refresh cycle of the RAM is 4ms (just under 63 scanlines), during which all 256 rows need to be refreshed. Because the serial output outputs one row at a time, it doesn't naturally access the memory locations in a refresh-friendly order. So the VDU needs to access at least 4.08 rows per scanline to prevent data loss. Makes sense to do those as 5 consecutive accesses per scanline. The PC1512's bus cycle for 8-bit accesses (such as VRAM) is 1uS (according to the table in section 1.1 of https://www.seasip.info/AmstradXT/1512tech/section1.html) or 8 cycles. 5 refresh accesses of 8 cycles is 40 cycles. The other 6 might be accounted for by initialising the VRAM's serial output buffers.

As for 7 STOSWs per scanline: On an 8088 a STOSW is 16 cycles. However, when the PC1512 is writing to VRAM the accesses are 8-bit and there will therefore be an additional 4 cycles of wait state per byte (1uS or 8 cycle access time again). That gives a total of 24 cycles per STOSW. So there must be some additional wait states in accessing VRAM. I wonder if it might be to do with the way that data is laid out in VRAM. If a single VRAM row+column address corresponds to 1 hdot then transferring a byte between VRAM and the CPU might translate into 8 VRAM accesses.

Reply 6 of 26, by wbhart

Posted on 2020-02-18, 20:49

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

Oh wow, that is really helpful! I had already concluded STOSW was probably taking 24 cycles, but for different reasons to you.

I had 11 CPU cycles for STOSW without a REP prefix, not 16. There could be another 4 cycles for prefetch, presumably. But I suspect there isn't much going on on the CPU bus, so I had concluded that adding this would be double counting. Instead, I took 24 cycles because the 16 bit access has to be broken into two 8 bit accesses. I also added 2 cycles for each STOSW on average for unfilled gaps (there are no assembly instructions that execute in less than 4 cycles including prefetch I think).

The only other thing I thought could be relevant is that in writing to video RAM it writes to four bytes simultaneously (there's 64 kb in four 16 kb bitplanes in mode 2, though in graphics mode 1 only 16 kb of the memory is needed, so the other planes just get the same data written into them). On the other hand, the CPU is supposed to be able to write to all planes simultaneously, so it doesn't quite seem like an explanation.

On the other hand, someone said (not authoritatively) that they believe writing to 4 planes in mode 2 is four times slower than writing to ordinary CGA memory. I haven't checked to see if that is true though. It's odd that the manual doesn't give more detail. There's just enough detail to be useless.

Something I am quite confused about, which you might be able to help me understand is this:

* the manual mentions conversions from 16 bit to 8 bit: are these costs always incurred, or only for stosw (which is a 16 bit access), as opposed to stosb (which is 8 bit)
* when are the 8 bit memory access costs incurred? Do those get added to the conversion costs? or are they instead of the conversion costs (and only incurred for 8 bit accesses)
* do wait states on the 8 bit bus also appear on the 16 bit bus?
* the manual says the CPU is connected to both the 8 and 16 bit bus (I understand this is possible for the 8086), but how can one be clocked at 4MHz and the other at 8MHz?

There are definitely additional wait states added by the CGA subsystem that are not in that table of wait states for the buses. The manual implicitly says this, as it says wait states can total between 12 and 46 cycles (including all the wait states added automatically by the bus). I would guess the additional wait states are due to the dot clock being out of sync with the processor clock (it's neither at 4 nor 8 MHz). The manual hints that this is the reason.

Sorry if I sound like I am confused or talking nonsense. I am really not that familiar with how CPU's and buses interact.

Edit: hmm, the block diagram in the manual does not show the 8 bit bus connected directly to the CPU. It shows the 16 bit bus connected to a bus gate array which is in turn connected to an 8 bit bus. Now I wonder if the 16-8 bit conversion is always incurred, even for stosb's.

Last edited by wbhart on 2020-02-18, 21:18. Edited 1 time in total.

YouTube Channel - PCRetroTech

Reply 7 of 26, by wbhart

Posted on 2020-02-18, 21:14

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

Oh that's annoying! My 8086 book says the timing for a stosb and stosw is the same, at 11 cycles. But other references give 15 cycles for stosw and 11 for stosb.

YouTube Channel - PCRetroTech

Reply 8 of 26, by wbhart

Posted on 2020-02-18, 21:45

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

Another thing that bothers me is this statement in the manual:

"The VDU display timing and system CPU/DMA timing are derived from different, unrelated reference frequencies. For this reason CPU accesses to the display RAM must be synchronised to the display timing by the VDU controller, and this is done by inserting CPU wait states as appropriate."

If the RAM is dual ported, doesn't this imply the two things are largely independent?

The dual porting would certainly explain the lack of snow in text mode.

YouTube Channel - PCRetroTech

Reply 9 of 26, by reenigne

Posted on 2020-02-18, 22:14

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 661
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

The reason why some references say 11 cycles for STOSW and some say 15 is that the former are for 8086 and the latter for 8088 - there is an additional 4-cycle penalty on 8088 due to the narrower bus.

STOSW without a REP prefix is documented as being 15 cycles on an 8088, but I measured it as 16 when running lots of them in a row (I think it's actually pretty difficult to get the 15 cycle case - the bus and queue state has to be lined up just right). When accessing VRAM on the PC1512 the 16-bit access has to be broken up into two 8-bit accesses (because the VRAM bus is 8 bits) so the situation is more similar to STOSW on an 8088 case. Then each of those 8-bit accesses take 8 cycles (because the 8-bit bus is 4MHz I guess - perhaps for compatibility with slow 8-bit ISA cards designed for a 4.77MHz ISA bus speed).

Because an 8086 (on the 16-bit bus) can fetch 2 bytes in 4 cycles you can actually get a 2 cycle per instruction throughput when executing a sequence of instructions that take 2 cycles each. On an 8088 the bus throughput limits the CPU to an instruction every 4 cycles minimum average.

That is really interesting about writing more bitplanes taking longer - I was wondering that myself. It makes some sense when thinking about how the data might be organised in VRAM. Definitely worth doing some experiments. If the timing does depend on the number of bitplanes written then it would be interesting to see how many bitplanes are used in the other modes. I'd guess (based on memory bandwidth considerations) 1-4 for 640x200, 1 for 320x200, 1 for 40x25 text and 2 for 80x25 text. From the point of view of the CPU it would look like all planes are written simultaneously (just with a longer wait state) but the VDU would be breaking up that write into 4.

The conversions from 16-bit to 8-bit will be incurred for any access which looks to the CPU like a 16-bit access. So any access that is word sized and word aligned.

The 8-bit memory access costs will be incurred when accessing 8 bit devices (i.e. anything except system RAM and 16-bit aware ISA cards). The costs will be added to the conversion costs.

It is probably possible to get wait states on the 16-bit bus but probably not with the stock hardware. The system RAM is the only stock thing on the 16-bit bus, and that doesn't have any wait states. But an ISA card could put wait states on the 16-bit bus.

The 8-bit bus can run at a different speed to the 16-bit bus because there is hardware in the PC1512 that mediates between the CPU and the two buses. To the CPU (running at 8MHz) it looks like the devices on the 8-bit bus just have more wait states. To the devices, it looks like the CPU is an 8088 running at 4MHz.

Yes, the mediation between the 14.318MHz dot clock (actually possibly the 1.79MHz/895kHz character clock) and the CPU clock will almost certainly introduce some delay as well.

Reply 10 of 26, by reenigne

Posted on 2020-02-18, 22:21

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 661
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

The RAM is dual ported but that doesn't mean that the two ports are completely independent - the VDU controller still needs to tell the RAM which addresses to send to the raster when. That is easier to do if all the VRAM control logic is derived from the same clock, i.e. the one that comes from the 28.6MHz crystal, since the CPU already has a mechanism for inserting wait states to coordinate things, but the interface between the VDU and the VRAM doesn't (extra circuitry would have to be added for that).

Reply 11 of 26, by wbhart

Posted on 2020-02-19, 00:08

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

reenigne wrote on 2020-02-18, 22:14:

The reason why some references say 11 cycles for STOSW and some say 15 is that the former are for 8086 and the latter for 8088 - there is an additional 4-cycle penalty on 8088 due to the narrower bus.

Of course there is, and I feel silly now for not noticing that. I actually originally wrote that my 8086 book gave the 8088 timings, then got confused and thought to myself, "no the timings are the same on both CPUs". It's particularly silly of me because about 5 minutes later I read exactly why the timings are different, and promptly forgot what I read. Sigh.

reenigne wrote on 2020-02-18, 22:14:

Because an 8086 (on the 16-bit bus) can fetch 2 bytes in 4 cycles you can actually get a 2 cycle per instruction throughput when executing a sequence of instructions that take 2 cycles each. On an 8088 the bus throughput limits the CPU to an instruction every 4 cycles minimum average.

And again, this is so obvious now that you've said it. I'd convinced myself that fetching a NOP took 4 cycles. So NOP is 3 cycles and if there's a few of them, that is the total cost.

reenigne wrote on 2020-02-18, 22:14:

That is really interesting about writing more bitplanes taking longer - I was wondering that myself. It makes some sense when thinking about how the data might be organised in VRAM. Definitely worth doing some experiments. If the timing does depend on the number of bitplanes written then it would be interesting to see how many bitplanes are used in the other modes. I'd guess (based on memory bandwidth considerations) 1-4 for 640x200, 1 for 320x200, 1 for 40x25 text and 2 for 80x25 text. From the point of view of the CPU it would look like all planes are written simultaneously (just with a longer wait state) but the VDU would be breaking up that write into 4.

I'm totally not convinced the 4 planes take longer. The reason is that in graphics mode 1, the memory for all four planes is written to, but only one of the planes is read. However, reading and writing take the same time. Replacing stosw's with lodsw's makes no difference to timings. But I will test it given that someone suggested it is slower.

reenigne wrote on 2020-02-18, 22:14:

The conversions from 16-bit to 8-bit will be incurred for any access which looks to the CPU like a 16-bit access. So any access that is word sized and word aligned.

I'm totally confused about how the CPU indicates that it wants to do an 8 bit write. There aren't two separate buses physically attached to it. I read that IO and memory access is multiplexed in the 8086. But there's 24 address lines and 16 data lines.

But just to confirm: you don't think that a stosb to CGA memory incurs any 16-8 bit conversion wait states? Then I really don't understand the timings.

I am quite sure two stosb's to CGA memory takes significantly longer than one (aligned) stosw (but gosh, I totally forgot about alignment). I confirmed this by noting that some of the stosw's in the seven stosw per raster pattern cannot be replaced with 2 stosbs without pushing things onto a second frame instead of everything fitting in a single frame.

But the costs don't add up:

2 stosbs = 2x11 cycles + 2x8 cycles of 8 bit memory accesses (including what the manual calls bus cycles) = 38 cycles

1 stosw = 15 cycles + 16 cycles of 16->8 bit conversion costs + 2x8 cycles of 8 bit memory accesses = 47 cycles

I suppose that in theory it is possible that two stosb's separates the 8 bit requests by enough that additional CGA wait states always intervene....

reenigne wrote on 2020-02-18, 22:14:

The 8-bit memory access costs will be incurred when accessing 8 bit devices (i.e. anything except system RAM and 16-bit aware ISA cards). The costs will be added to the conversion costs.

I vaguely thought there were only 8 bit ISA slots in the PC1512.

I have thought of some things I can actually try:

* Try removing each stosw in turn in the pattern I've found so far and replace them with nops, to get approximate timings for each stosw
* Make a macro for 240 rasters that only does stosw's and nops and then follow it by another 24 or so rasters that actually change the background colour. Then I can break the 240 rasters up into sets of 2, 3, 4, 5, 6 and see if I can insert more nops than the "optimal" pattern I've found so far, every 2, 3, 4, 5 or 6 rasters. Given that I have a pretty stable pattern, it seems likely that there is some fixed pattern repeated over and over. But it's not necessarily the same every raster. I might have just found the lowest common denominator. This is especially true as two of the three nop patterns seem awfully like they should be four nops. Only every few frames run over if I put four in those positions.
* Try changing the palette every second raster to get more colours (interrupts probably have to be off, e.g. the keyboard would probably stuff this up; might be ok for a demo effect though)
* Try to figure out why I never see anything like 46 cycle gaps (with 3 cycles per nop, the most I ever see is actually 15 cycles)
* time writing to multiple planes (this has to be in graphics mode 2, not the CGA mode, which is Amstrad's mode 1; the memory is not treated as bitplanes in mode 1)
* figure out why stosb's aren't faster than stosw's in this CGA implementation
* do timings of exactly the same sequences of stosw's and nop's, but into main memory instead of CGA memory, to get an exact figure for the difference
* try instructions of different numbers of cycles instead of nop's to try to get even more precise information
* do all the same analysis for movsw's into CGA memory from main memory

YouTube Channel - PCRetroTech

Reply 12 of 26, by wbhart

Posted on 2020-02-19, 00:20

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

reenigne wrote on 2020-02-18, 22:21:

The RAM is dual ported but that doesn't mean that the two ports are completely independent - the VDU controller still needs to tell the RAM which addresses to send to the raster when. That is easier to do if all the VRAM control logic is derived from the same clock, i.e. the one that comes from the 28.6MHz crystal, since the CPU already has a mechanism for inserting wait states to coordinate things, but the interface between the VDU and the VRAM doesn't (extra circuitry would have to be added for that).

According to the block diagram, there is a VDU gate array (a phrase Amstrad seem to have made up) between what looks like the 8 bit bus, which the VDU RAM seems to be connected to, and the VDU. So I wonder if that extra circuitry is actually there.

This would also explain why I find a nice regular pattern (I think), instead of things just essentially never completely syncing up. Could it not be that there is some kind of very small buffer in the VDU gate array and it inserts wait states on the 8 bit bus every time it needs to access it, but that this is done in a very regular pattern?

I doubt there are data sheets for these chips, otherwise we could look up what those chips actually do.

YouTube Channel - PCRetroTech

Reply 13 of 26, by reenigne

Posted on 2020-02-19, 12:13

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 661
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

wbhart wrote on 2020-02-19, 00:08:

I'm totally not convinced the 4 planes take longer. The reason is that in graphics mode 1, the memory for all four planes is written to, but only one of the planes is read. However, reading and writing take the same time. Replacing stosw's with lodsw's makes no difference to timings. But I will test it given that someone suggested it is slower.

It'd probably make for a simpler hardware design if the accesses take the same amount of time no matter the mode or how many bitplanes are active or whether it is a read or a write. But it'd be a nice optimisation for the machine to not do more memory cycles than it needs to. Anyway, it's an easy thing to check.

wbhart wrote on 2020-02-19, 00:08:

I'm totally confused about how the CPU indicates that it wants to do an 8 bit write. There aren't two separate buses physically attached to it. I read that IO and memory access is multiplexed in the 8086. But there's 24 address lines and 16 data lines.

That is what the BHE (Bus High Enable) pin on the 8086 is for.

wbhart wrote on 2020-02-19, 00:08:
But just to confirm: you don't think that a stosb to CGA memory incurs any 16-8 bit conversion wait states? Then I really don't understand the timings.

A stosb to CGA memory is one access on the 8-bit bus, so 8 cycles (4 cycles of wait state). Plus whatever the CGA-specific wait state is.

wbhart wrote on 2020-02-19, 00:08:
I am quite sure two stosb's to CGA memory takes significantly longer than one (aligned) stosw (but gosh, I totally forgot about alignment). I confirmed this by noting that some of the stosw's in the seven stosw per raster pattern cannot be replaced with 2 stosbs without pushing things onto a second frame instead of everything fitting in a single frame.

Yes, that's expected. Two stosbs will always take significantly longer than even an unaligned stosw, because there are 8 (or 7) cycles of per-instruction overhead to these instructions.

But the costs don't add up:

wbhart wrote on 2020-02-19, 00:08:
2 stosbs = 2x11 cycles + 2x8 cycles of 8 bit memory accesses (including what the manual calls bus cycles) = 38 cycles

1 stosw = 15 cycles + 16 cycles of 16->8 bit conversion costs + 2x8 cycles of 8 bit memory accesses = 47 cycles

stosw is 11 cycles on 8086. I think the 2x8 cycles of 8 bit memory accesses are included in the conversion costs. The total time for a 16-bit access on the 8-bit bus is 2uS or 16 cycles, 12 more than the 4 it would be on the 16-bit bus. So the stosw time would be 11+12 = 23 cycles, not including CGA-specific wait states.

wbhart wrote on 2020-02-19, 00:08:
I suppose that in theory it is possible that two stosb's separates the 8 bit requests by enough that additional CGA wait states always intervene....

That is quite possible.

wbhart wrote on 2020-02-19, 00:08:
I vaguely thought there were only 8 bit ISA slots in the PC1512.

Ah yes, you're right - that's a brain fart on my part. In that case there is probably no way to get wait states on the 16 bit bus.

wbhart wrote on 2020-02-19, 00:08:
* Try changing the palette every second raster to get more colours (interrupts probably have to be off, e.g. the keyboard would probably stuff this up; might be ok for a demo effect though)

The keyboard will only give you an interrupt if you press or release a key. But some ISA cards have their own interrupts, so it's always best to work with interrupts off when doing stuff that requires very precise timings.

wbhart wrote on 2020-02-19, 00:08:
* Try to figure out why I never see anything like 46 cycle gaps (with 3 cycles per nop, the most I ever see is actually 15 cycles)

Ah, so the 46 cycle gap was documented rather than observed? It's possible that they were conservative with the documentation and they changed the design so that the refresh wait states were spread out along the scanline, meaning that the 46 cycle wait never actually occurs.

wbhart wrote on 2020-02-19, 00:20:
This would also explain why I find a nice regular pattern (I think), instead of things just essentially never completely syncing up. Could it not be that there is some kind of very small buffer in the VDU gate array and it inserts wait states on the 8 bit bus every time it needs to access it, but that this is done in a very regular pattern?

Yes, I think that sounds very likely.

wbhart wrote on 2020-02-19, 00:20:
I doubt there are data sheets for these chips, otherwise we could look up what those chips actually do.

Definitely a disadvantage to the use of gate arrays! With the PC/XT, the answers are pretty much all there in the schematics and the datasheets for the various chips.

Reply 14 of 26, by wbhart

Posted on 2020-02-19, 13:26

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

reenigne wrote on 2020-02-19, 12:13:

wbhart wrote on 2020-02-19, 00:08:
2 stosbs = 2x11 cycles + 2x8 cycles of 8 bit memory accesses (including what the manual calls bus cycles) = 38 cycles

1 stosw = 15 cycles + 16 cycles of 16->8 bit conversion costs + 2x8 cycles of 8 bit memory accesses = 47 cycles

stosw is 11 cycles on 8086. I think the 2x8 cycles of 8 bit memory accesses are included in the conversion costs. The total time for a 16-bit access on the 8-bit bus is 2uS or 16 cycles, 12 more than the 4 it would be on the 16-bit bus. So the stosw time would be 11+12 = 23 cycles, not including CGA-specific wait states.

That would mean the numbers really don't add up. A raster is 509.6 cpu cycles. If we take 11 cycles for stosw (granted, I used the wrong number here) and just 12 additional cycles for the part of the access not already included in the stosw time, then 7 stosw's and 21 nops should take 161 + 63 = 224 cycles. That is one hell of a long way short of 509.6.

Given that the nops are surely executing while the CGA wait states are happening between stosw's and given that two (or three; I forget) of the stosws cannot be replaced with a pair of stosb's and the rest can only be replaced with pairs of stosb's separated by 2 nops at most, there can't be that many CGA wait states happening during stosw's.

And now you see precisely the source of my confusion regarding the timings! It's almost twice as slow as one would expect based on the info in the manual.

reenigne wrote on 2020-02-19, 12:13:

wbhart wrote on 2020-02-19, 00:08:
* Try changing the palette every second raster to get more colours (interrupts probably have to be off, e.g. the keyboard would probably stuff this up; might be ok for a demo effect though)

The keyboard will only give you an interrupt if you press or release a key. But some ISA cards have their own interrupts, so it's always best to work with interrupts off when doing stuff that requires very precise timings.

Sure. I was thinking of whether this could be used for games. Probably not, as for example California games already gets a significant amount of colour bleed at the points where it changes palettes when the keys are mashed in the games (on an IBM PC ) in their more-colour CGA mode.

reenigne wrote on 2020-02-19, 12:13:

wbhart wrote on 2020-02-19, 00:08:
* Try to figure out why I never see anything like 46 cycle gaps (with 3 cycles per nop, the most I ever see is actually 15 cycles)

Ah, so the 46 cycle gap was documented rather than observed? It's possible that they were conservative with the documentation and they changed the design so that the refresh wait states were spread out along the scanline, meaning that the 46 cycle wait never actually occurs.

Yes, I am taking the figure of 46 cycles every scanline from the Amstrad manual. I do not observe it in practice.

Going from memory right now, but I think the most optimal pattern per raster for stosws and nops is:

stosw, nop x5, stosw, nop x2, stosw, nop x3, stosw, nop x2, stosw, nop x3, stosw, nop x4, stosw, nop x2.

I don't remember off the top of my head which of the stosws can and which can't be replaced with a pair of nops.

I tried to find a two raster pattern last night, and whilst the behaviour was very slightly different, there seems to be no variation in the maximum number of nops that do not cause additional glitches.

Again going from memory, but I recall that the nop x5 and the final nop x4 and nop x2 absolutely can't be increased. The others nearly can be increased by a single nop each, but every few frames there will be glitches as a stosw gets delayed until the next gap. This could be due to other overheads in my code. But I did some work to cut down the variation in detection of vertical retrace. Unfortunately that just didn't make any difference. If I add additional nops it occasionally misses the best place to insert stosws and this causes very regular and noticeable glitches, though clearly most rasters the extra nop is tolerated without causing a stosw to miss its schedule. I also tried adding additional rasters at the beginning with fewer or no nops between to help things settle down before inserting extra nops in the pattern later. Again, no difference.

At one point I had the code set up in such a way that addition of a single extra nop per frame (not per raster) would cause totally different behaviour. That was very surprising, and I don't have an explanation just yet, though my vertical retrace code is designed to be very fragile so it will quit immediately if it misses detection in a small window. The extra nop was presumably pushing it off one end of the short window I set up. But this is still very surprising as the additional nop was after the v-retrace detection, which means that everything else in the frame was so tightly packed that it pushed a subsequent v-retrace detection off the ends of the window for detection.

I will try three raster patterns tonight. I intend to go all the way up to 6 raster patterns is necessary, as the numbers as I currently have them just don't add up.

Last edited by wbhart on 2020-02-19, 14:20. Edited 1 time in total.

YouTube Channel - PCRetroTech

Reply 15 of 26, by reenigne

Posted on 2020-02-19, 14:03

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 661
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

wbhart wrote on 2020-02-19, 13:26:

That would mean the numbers really don't add up. A raster is 509.6 cpu cycles. If we take 11 cycles for stosw (granted, I used the wrong number here) and just 12 additional cycles for the part of the access not already included in the stosw time, then 7 stosw's and 21 nops should take 161 + 63 = 224 cycles. That is one hell of a long way short of 509.6.

Yeah, most of the difference will be the CGA wait states, I expect. There may also be cycles stolen by system DRAM refresh, but that's probably a smaller effect. I'm not sure how DRAM refresh works on the PC1512.

wbhart wrote on 2020-02-19, 13:26:
Sure. I was thinking of whether this could be used for games.

I'm not sure if it works on clones, but on the PC/XT you can turn off the keyboard interrupt and poll the keyboard instead so that you don't spend cycles handling the keyboard in timing-critical code. I guess even on clones you can just turn the keyboard interrupt on for a part of each frame.

wbhart wrote on 2020-02-19, 13:26:
Probably not, as for example California games already gets a significant amount of colour bleed at the points where it changes palettes when the keys are mashed in the games (on an IBM PC ) in their more-colour CGA mode.

The California Games more-colour CGA mode is kind of buggy, though. It's impressive that they managed to make it work at all, but with what we know today they could have done a much better job.

Reply 16 of 26, by wbhart

Posted on 2020-02-19, 14:27

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

reenigne wrote on 2020-02-19, 14:03:

wbhart wrote on 2020-02-19, 13:26:

That would mean the numbers really don't add up. A raster is 509.6 cpu cycles. If we take 11 cycles for stosw (granted, I used the wrong number here) and just 12 additional cycles for the part of the access not already included in the stosw time, then 7 stosw's and 21 nops should take 161 + 63 = 224 cycles. That is one hell of a long way short of 509.6.

Yeah, most of the difference will be the CGA wait states, I expect. There may also be cycles stolen by system DRAM refresh, but that's probably a smaller effect. I'm not sure how DRAM refresh works on the PC1512.

But that's my point. Where are the CGA wait states? Why can't I run nops during those wait states? Surely wait states only mean the CPU can't access the bus, not that it stops altogether (granted it has to do instruction fetch).

I totally get that I can't initiate a stosb and then do a nop while the CGA card sits there throwing wait states so that the stosb is delayed. The nop can't start until the stosb is finished. But if I put nops *before* the stosb so that it tries to write a byte at a more optimal time, so that is doesn't actually have to wait for CGA wait states, then things should be much more optimal.

If I use stosb's instead of stosw's I get about 2 nops between each stosb. So the stosw's in my sequence above can't possibly be concealing many CGA wait states.

YouTube Channel - PCRetroTech

Reply 17 of 26, by wbhart

Posted on 2020-02-19, 14:32

wbhart Offline

Rank Newbie

Rank: Newbie
Posts: 80
Joined: 2019-08-11, 11:00

I know what I can do. I can use shl reg, cl instead of nops. Then the CPU won't need to access the bus and I can get variable length instruction timings.

Perhaps the CGA wait states are really stopping the CPU from using the bus to prefetch the nops.

I should have thought of this earlier.

YouTube Channel - PCRetroTech

Reply 18 of 26, by reenigne

Posted on 2020-02-19, 14:35

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 661
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

wbhart wrote on 2020-02-19, 14:27:

But that's my point. Where are the CGA wait states? Why can't I run nops during those wait states? Surely wait states only mean the CPU can't access the bus, not that it stops altogether (granted it has to do instruction fetch).

I totally get that I can't initiate a stosb and then do a nop while the CGA card sits there throwing wait states so that the stosb is delayed. The nop can't start until the stosb is finished. But if I put nops *before* the stosb so that it tries to write a byte at a more optimal time, so that is doesn't actually have to wait for CGA wait states, then things should be much more optimal.

If I use stosb's instead of stosw's I get about 2 nops between each stosb. So the stosw's in my sequence above can't possibly be concealing many CGA wait states.

So there are two different possible types of wait states. One is "wait until a particular clock is at a particular phase" and the other is "wait a certain number of cycles". The first type can replaced by an earlier nop but the second can't. It's likely that there are some of each type introduced by the VDU controller. The first type for synchronisation with the character clock and DRAM refresh, and the second type for accessing (up to) four memory bitplanes.

Reply 19 of 26, by reenigne

Posted on 2020-02-19, 14:41

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 661
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

wbhart wrote on 2020-02-19, 14:32:

I know what I can do. I can use shl reg, cl instead of nops. Then the CPU won't need to access the bus and I can get variable length instruction timings.

Perhaps the CGA wait states are really stopping the CPU from using the bus to prefetch the nops.

I should have thought of this earlier.

I have done exactly this in the past but with MUL instructions instead of shifts. The MUL instruction can be tuned with an accuracy of 1 cycle (number of set bits in the accumulator) but incrementing CL increases the shift instructions by 4 cycles.

Main menu