VOGONS

Common searches


Reply 20 of 26, by wbhart

User metadata
Rank Newbie
Rank
Newbie
reenigne wrote on 2020-02-19, 14:41:
wbhart wrote on 2020-02-19, 14:32:

I know what I can do. I can use shl reg, cl instead of nops. Then the CPU won't need to access the bus and I can get variable length instruction timings.

Perhaps the CGA wait states are really stopping the CPU from using the bus to prefetch the nops.

I should have thought of this earlier.

I have done exactly this in the past but with MUL instructions instead of shifts. The MUL instruction can be tuned with an accuracy of 1 cycle (number of set bits in the accumulator) but incrementing CL increases the shift instructions by 4 cycles.

I think the minimum for mul is too high here though, perhaps.

YouTube Channel - PCRetroTech

Reply 21 of 26, by wbhart

User metadata
Rank Newbie
Rank
Newbie

There is no three raster pattern that is better than the one raster pattern I mentioned above.

Both of the sequences of 3 nops can be replaced with 4, especially the second sequence of 3 and one almost gets away with it. There's just a tiny bit of jitter as the stosw's occasionally miss their schedule.

But it's easy to see that just inserting one extra nop per three rasters isn't going to have as great an effect as doing it every raster. So there's nothing special about the fact that I'm using a pattern that repeats every 3 rasters.

Technically there is a slight difference in behaviour. But it's not enough to conclude that something is repeating every three rasters.

So I guess I go on to 4, 5 and 6 raster patterns. I'll maybe do one a day, as it gets boring. I could automate it, but writing code for that is also error prone and boring, so I prefer to do it by hand so I can actually visually inspect the effect.

YouTube Channel - PCRetroTech

Reply 22 of 26, by wbhart

User metadata
Rank Newbie
Rank
Newbie

Ah, the reason why adding a single nop per frame made such a difference is because of instruction alignment. It probably matters whether stosw nop or nop stosw is prefetched, since if the nop has to be prefetched at the wrong time, it pushes everything else out.

YouTube Channel - PCRetroTech

Reply 23 of 26, by wbhart

User metadata
Rank Newbie
Rank
Newbie

I tried the shl ax, cl trick. The 5, 2, 3, 2, 3, 4, 2 nops can be replaced with shl ax, cl with cl = 2, 1, 1, 1, 2, 2, 1. So I make that 16, 12, 12, 12, 16, 16, 12 cycles. So the gaps between the stosw's account for just 96 cycles total. The nops were just 63 cycles total, so at least we've found another 33 cycles that have been missing.

Now (509.6 - 96)/7 = 59. So each stosw is taking that long on average (unless there is a wildly more optimal pattern out there somewhere).

The next thing I will do is try to get timings for the individual stosw's (or at least their average times at that position in the pattern I'm using).

YouTube Channel - PCRetroTech

Reply 24 of 26, by wbhart

User metadata
Rank Newbie
Rank
Newbie
reenigne wrote on 2020-02-19, 14:35:

So there are two different possible types of wait states. One is "wait until a particular clock is at a particular phase" and the other is "wait a certain number of cycles". The first type can replaced by an earlier nop but the second can't. It's likely that there are some of each type introduced by the VDU controller. The first type for synchronisation with the character clock and DRAM refresh, and the second type for accessing (up to) four memory bitplanes.

Yes, I certainly don't expect to get around the memory access wait states. But just to confirm: both types do not halt the CPU. They only tie up the bus, right? If the CPU doesn't need to use the bus, all is ok?

YouTube Channel - PCRetroTech

Reply 25 of 26, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
wbhart wrote on 2020-02-20, 00:59:

Yes, I certainly don't expect to get around the memory access wait states. But just to confirm: both types do not halt the CPU. They only tie up the bus, right? If the CPU doesn't need to use the bus, all is ok?

They tie up the bus, but when the CPU is doing an IO for the purposes of executing an instruction (like the stores of a stosw) the execution unit will stop and wait for the bus to finish before doing anything else. So the wait states incurred by the 4MHz 8-bit bus and the VDU controller will halt the CPU. The only type of bus access that the CPU can do while doing other things at the same time is instruction prefetching.

Reply 26 of 26, by wbhart

User metadata
Rank Newbie
Rank
Newbie

Last night I thought I would try adding an additional stosb every few rasters and see if it would fit. You can't add an extra stosb every raster or every two rasters, but it seems to be possible to add an extra one every three rasters, maybe.

The problem seems to be that jitter caused by detecting v-retrace at random times and by other things going on, e.g. DRAM refresh means that some of the frames don't finish in time. The way I was handling that previously was by replacing the pattern for the first raster with a few stosws (less than the seven that should fit in a raster) so that there is some slack to be taken up.

This strategy doesn't seem to work with the extra stosb every three rasters. I haven't found a way to stabilise the pattern and about 1-2% of the frames don't complete in time. Moreover, jitter in raster bars towards the end of the frame is greater than an entire raster, and certainly doesn't come anywhere near fitting into the h-retrace period. Inserting nops between the stosw's also seems to just increase the number of frames that don't complete in time and doesn't improve stability.

I'm not sure that's a solvable problem, for two reasons: DRAM refresh is not going to go away and is probably not able to be synced with v-retrace in any way and the dot clock is probably not a small rational multiple of the CPU clock, so that some slack is actually always needed in order to get stability.

I might try adding one stosb per four rasters tonight and see if I can fit that in and still have enough slack for stability.

Having said all that, just putting 7 stosws per raster with zero nops between also did not result in a stable pattern. I actually needed to insert the nops to get the thing stable enough that jitter only occurs during the h-retrace period.

So perhaps the three raster pattern with the extra stosb can be stabilised with some pattern of nops. It doesn't seem all that easy to find if it does exist, though.

I think I may have been incredibly lucky with the 7 stosb per raster pattern that I found. It seems to be hard to find anything similar.

I think this all does add evidence for my contention that there is *something* repeating every raster, though it does show that there is some variability from raster to raster. It's not exactly the same pattern of wait states every time, I don't think.

YouTube Channel - PCRetroTech