ATI Graphics Solution

Reply 20 of 127, by Deunan

Posted on 2019-05-30, 22:48

Deunan Offline

Rank Oldbie

Rank: Oldbie
Posts: 1575
Joined: 2018-05-29, 12:32

Scali wrote:
bitplanes allow you to cleverly arrange your palette so that setting a single bit can act as a transparency effect for example.

OK, so there is _one_ scenario where having individual bit per every pixel is useful. Except not on any PC cards from that era. But other than that, what other uses are there? I'm serious, give me some examples where setting just one bit of out N that a pixel has would be such a huge win that you'd want them on a separate planes. Screen clearing or redrawing requires, on average, full flush of the pixel color anyway. It gets even worse if you have to draw something that is not aligned to 8 pixels, then you still end up with read-modify-write. The whole VGA pixel enigne was basically meant to overcome this stupid idea with even more complicated HW solution.

Everyone usually comes up with "Amiga did it" - well, sure, and it flopped once SVGAs finally moved to banked 256+ color modes. FM Towns was basically a gaming machine made in '89 and it had graphics modes that more-or-less matched the VGA, except with 256 (640x480) and 32k (320x240) color modes. It also had 16-color modes where every byte was 2 pixels. And a sprite engine. Nothing about this is weird or difficult to use - clearly, someone sat and though "If I wanted to make a fast DMA blit to VRAM, how should the pixels be packed?".

Scali wrote:
I actually need to pull the display port cable loose, then power-cycle the PC, and then plug it back in once the PC is booting, else the video simply won't come on anymore.

I don't think that's a driver issue. Possibly HW one. I've noticed that the 2200G APU that my parents have now sometimes will go to sleep and then will not wake up with a picture. I have to disconnect and reconnect the video cable (I can hear Windows sounds for device lost/device found as I do that) to get the monitor to show something again.
But, it's an old LCD that is now using DP to VGA adapter. I've never had this issue with a modern monitor that has native DP input. So I think there's something wrong with how the iGPU detects the active output and I suspect it's not the software that's the problem. AMD has been re-using the same output blocks in all their GCN series (or even earlier than that) and possibly there is a glitch there somewhere.

Scali wrote:
It had a combined AMD and Intel graphics setup, and it rarely worked when I plugged in a VGA or HDMI cable for a beamer. Never had that problem with combined NV and Intel laptops.

Lucky you. There's no shortage of NV Optimus issues on laptops but I don't have one and can't bring up any first-hand experience. Though I have to say, I've never seen a laptop with Intel+AMD combo that I'd pick over Intel-only or AMD-only. Maybe that recent thing that Intel did, with AMD GPU and VRAM on interposer, would actually work and not be gimped somehow by thermals or stupid design decision on the laptop manufacturer's part. But I haven't used that either so can't comment.
What I can say is I've seen a lot of very happy AMD Zen APU users. It's like the perfect gaming machine for people who can't afford a modern console but need a PC for the kids anyway. Again, it "just works".

Scali wrote:
Which made me wonder how you would ever get both these cards working in the same system.

You don't - and that's not just AMD thing. You can't really use 2 (or more) cards with different GPU family chips. But the way I see it you've learned an important lesson - do not use OpenGL and there's no problem.

Scali wrote:
AMD also sucks at DirectX 11. They simply didn't even bother to implement the multithreading model at all.

That's a huge misconception. DX11 was never meant to be threaded - you can tell just by looking at the API calls. They have to come from the same thread, period. It's a bit complicated but the long story short it boils down to having API that is thread-safe, or fast. DX11 was meant to be fast, and not do any unnecessary sanity checks in runtime. The cost of that is there are cases where you can't tell what thread submitted what - because of how Windows driver model, and process isolation in general, was designed.
There is one particular case where you can build a list of commands and submit that, and the building can be done in separate thread (submission must still be tied to the thread that owns all DX11 objects). But there are some serious limits to what this method can do. It works reall well for NV who does a lot of pre-rendering things driver-side (tile binning, early out-of-bound checks, etc). AMD however relies on HW doing all that stuff and for them the requirement is you have to submit all the stuff as early as you can so that GPU can start working on it right away and not choke. That's great for computations but doesn't fit DX11 rendering pipeline well. They sort-of brute-force this in other words. There's a reason why consoles use mostly compute for actual rendering, and why such game engines on PC (new Doom for example) work so well on AMD.

TL;DR: It's not really driver limitation on AMD side but the result of having a compute-oriented architecture doing rendering via API that's not designed for it. As for why AMD has just one arch for everything rather than 2 (compute and gaming) like NV does, it's a huge topic that's mostly finance and economies-of-scale related.

Reply 21 of 127, by Scali

Posted on 2019-05-31, 09:22

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

Deunan wrote:
OK, so there is _one_ scenario where having individual bit per every pixel is useful.

No, I gave one example. There are more, but it's more about understanding how to think in bitplanes.
Apparently you are a graphics n00b, but are somehow convinced you're an expert, so you're not open to the possibility that you need to broaden your horizon and look at this bitplane-thing with an open mind, accepting that you do not understand it yet, and it requires effort to do so.
It's pointless for me to iterate examples, when just a single one is already met with so much resistance and cognitive dissonance from your side.
Fact is that many systems (including many arcade machines) used a bitplane approach back in the day, so apparently consensus was that this was a good idea. They can't all be stupid, and have made this huge mistake that the great Deunan could have prevented, right?

Deunan wrote:
Except not on any PC cards from that era.

EGA and VGA do.
Here's an example:
https://youtu.be/Qhym3zCa7Os?t=169

This runs in a 16-colour mode, so 4 bitplanes. The shadow and the translucent effect are done by rendering these polygons to specific bitplanes, and cleverly choosing the palette.
It's quite possibly the fastest polygon routine ever made on PC, and runs very smoothly even on a simple 286. Because of the clever use of bitplanes, the translucency effect is basically 'free'. Try doing that in VGA's chunky mode 13h, and you need a much faster CPU and video card to achieve the same.

Deunan wrote:
But other than that, what other uses are there? I'm serious, give me some examples where setting just one bit of out N that a pixel has would be such a huge win that you'd want them on a separate planes. Screen clearing or redrawing requires, on average, full flush of the pixel color anyway.

You have to think in layers. You can draw different parts of the screen in different layers, and in some cases only certain layers need to be cleared/redrawn.
Here's another example of that:
https://youtu.be/sxW6CW0RWlg?t=214

The scrolling is done in some bitplanes, and the polygons are drawn in others. This means the polygons don't actually overwrite the background data, and therefore you don't need to update the scroll-area every frame. Only the polygon area.

And another example of the same technique:
https://youtu.be/QLVi2zKCgCo

The purple background is scrolled in one bitplane.
The blue scroller is in another bitplane (it's just one colour, but the palette is changed per scanline, similar to copper tricks on Amiga).
The remaining two bitplanes are used for the scaling logo and vertical scroller.
This would give you huge headaches in VGA mode.

This is also easy to see in games from that era. The Amiga version will have multi-layered/parallax scrolling, and special gradient colour backdrops and such, all cleverly making use of bitplanes and palette updates.
The PC ports generally lacked this.
Look at Xenon 2 for example (even the Atari ST port has the extra background layer, PC does not), or Blues Brothers.

I think the point you need to take away from this is that not all things that you draw, need to use all bitplanes.
Agony on Amiga is probably the most extreme example of this all: https://youtu.be/O4RnxYPpMKI?t=310
Nobody ever even bothered to port it to other platforms because it would be futile.

By the way, even on the C64, clever coders could create similar multi-layer scrolling:
https://youtu.be/zAtW-3sx_1s?t=368
https://youtu.be/zAtW-3sx_1s?t=702

And all that action at the full 50 Hz speed. Hardly any CGA/EGA/VGA game ran at that speed, even on the most powerful hardware, back in the 80s.
Something about working smarter, not harder.

Deunan wrote:
It gets even worse if you have to draw something that is not aligned to 8 pixels, then you still end up with read-modify-write.

I don't think you understand bitplanes then, because you always need to do a read-modify-write operation. Memory just is addressed per-byte, not per-bit. Alignment has nothing to do with that. The EGA ALU helps you with that though.
Of course the Amiga is even more powerful, because you can scroll and combine individual bitplanes, and you have the blitter to make bitwise operations more efficient.

Deunan wrote:
The whole VGA pixel enigne was basically meant to overcome this stupid idea with even more complicated HW solution.

Not really.
Mode 13h was originally meant as a photorealistic mode (similar to HAM mode on Amiga), and gaming/animation probably was not even a consideration at that time. VGA dates from 1987, when games were still primarily based on 2D, scrolling, sprites, transparency/multilayer etc. All things that VGA isn't particularly good at.
It wasn't until the early 90s that texturemapped software rendering in 3D became a thing, which VGA just happened to be reasonably well-suited to. Even so, many early games, such as Wolf3D and DOOM, still opted to use the undocumented mode X, which is still a sort of bitplaned-mode, over the standard chunky mode 13h.

Deunan wrote:
Everyone usually comes up with "Amiga did it" - well, sure, and it flopped once SVGAs finally moved to banked 256+ color modes

But Amiga dates from 1985, and SVGA didn't take hold until 1992 or so.
SVGA is just far less efficient. A game like Jazz Jackrabbit wasn't possible before 486/VLB machines (Commander Keen specifically stuck to EGA, for better performance, well into the VGA-era, resulting in a game that was playable on contemporary hardware, but far less fancy than similar games on C64, Amiga, Atari, NES etc), while these games are easy to do on stock Amiga hardware from 1985. Scrolling at 50 Hz was never an issue, nor was using large sprites/bobs. Something the PC always struggled with, because it required a lot of read-modify-write over the slow ISA bus, hogging the CPU.
Chunky VGA mode may be easy to use, but 8 bits per pixel requires a lot of bandwidth.

Deunan wrote:
I don't think that's a driver issue.

Pretty sure it is. It always happens when the video card goes into power save mode, and won't wake up anymore. The card never spontaneously stops working in the middle of something. So the driver either doesn't put it into power save mode properly, or it doesn't wake it up properly.
Probably has to do with the fact that I use two display adapters (also the onboard Intel, because I use two 4k monitors, and the R7 360 can only drive one of them). The other adapter never has an issue. And when I used two lower-res monitors both on the AMD, and not using the Intel, I don't think it ever happened either. So 99% sure it's a driver issue.

Deunan wrote:
You don't - and that's not just AMD thing. You can't really use 2 (or more) cards with different GPU family chips. But the way I see it you've learned an important lesson - do not use OpenGL and there's no problem.

Actually, you can. This was actually a requirement for the Windows 7 driver model, and is the reason why NV's Optimus is possible. Multiple display drivers have to be able to run side-by-side, and have to be able to share resources. This is also required for multi-display setups. For example, you can use one GPU to render video or 3D, and then have the output copied to a window that is hosted on the output of another GPU.
I've used systems with combined NV and Intel, combined AMD and Intel, combined AMD and NV, and also multiple NV chips. The only thing that has failed so far is ATi + ATi. It would have worked if their drivers were designed properly.

Deunan wrote:
That's a huge misconception. DX11 was never meant to be threaded - you can tell just by looking at the API calls.

Lolwat? You talk like you are an expert on graphics, but then you make all these weird, misguided statements.
DX11 specifically introduced the concept of 'deferred context', which allows you to create command lists on separate worker threads.
Here's a presentation from NV on the subject: https://developer.nvidia.com/sites/default/fi … redContexts.pdf
And here is one from Intel: https://software.intel.com/en-us/articles/per … eaded-rendering
Intel actually goes into how performance scales on NV and AMD hardware by pointing out where the workload is shifted to.

But as I said, AMD never implemented this, so any deferred context would just buffer commands and process them on the main thread, when the 'command list' is executed. There are various DX11 API test tools available that show this. NV and Intel can get scaling from using multiple threads, on AMD the performance is always 1x.

Deunan wrote:
TL;DR: It's not really driver limitation on AMD side but the result of having a compute-oriented architecture doing rendering via API that's not designed for it. As for why AMD has just one arch for everything rather than 2 (compute and gaming) like NV does, it's a huge topic that's mostly finance and economies-of-scale related.

TL;DR: you're talking BS.
This is what Microsoft says:
https://docs.microsoft.com/en-us/windows-hard … nd-3-d-pipeline

All drivers should eventually fully support all types of threading operations (that is, all drivers should eventually support all the threading capabilities of the D3D11DDI_THREADING_CAPS structure). However, the driver can require that the API emulate command lists or enforce a single-threaded mode of operation for the driver.

AMD just never got it worked out, and started this nonsense talk that you just reiterated, because AMD fanboys are clueless and buy that sort of crap (compute and gaming two archs? Really? NV just artificially limits the double-precision float performance on consumer cards. That has absolutely nothing to do with the ability to multithread, because the architecture is the same. As you say, it's finance and economies-of-scale related, not a technical issue).
I mean, any sensible person would understand that there is no technical limitation to multithreading. If they can do it with Mantle/DX12/Vulkan, apparently it is possible, and should also be possible within the DX11 API model. After all, the DX11 API is just an abstraction level higher than DX12, and can be implemented on top of DX12. In fact, Microsoft has already done so, with the D3D11on12 API: https://docs.microsoft.com/en-us/windows/desk … rect3d-11-on-12

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 22 of 127, by Deunan

Posted on 2019-05-31, 12:57

Deunan Offline

Rank Oldbie

Rank: Oldbie
Posts: 1575
Joined: 2018-05-29, 12:32

Scali wrote:
Apparently you are a graphics n00b, but are somehow convinced you're an expert

I'm sorry, did I strike a nerve there? It was the Amiga comment, wasn't it?

Early computing tried all sorts of things (VAX instruction set says hello), only to finally settle with solutions that actually work. I'm aware bitplanes were used in other system - but perhaps because the chips were already there? How many arcade machines had a fully custom graphics processing vs something that was already on the market? And how many of those were using bitplanes?

Scali wrote:
Here's an example

A demo. Seriously? It's nice to look at, and possibly breaks the limits of what was thought possible, but how is it useful again?
I expected this answer though. That, or you bringing up Doom engine that used mode X rather than mode 13. And while Doom is actually a proper example, it also suffers from a very interesting issue - it doesn't scale well beyond certain point with CPU speed. Why? Exactly becuase it uses a technique that is dependant on the VGA HW speed rather than just framebuffer memory.

I'm going to skip the rest of the demos - this is all modern stuff. The way I see it, it only took some clever people 20+ years to finally hack some software workarounds for the stupid HW of the '80. And does it work on every VGA, or just some 100% compatible models? Probably the latter.
As for Amiga - there's no reason to debate a dead system that had a moment of glory exactly because IBM dropped the ball so bad with CGA, EGA and then made VGA complicated and expensive. That was my very point, in case you missed it.

Scali wrote:
I don't think you understand bitplanes then, because you always need to do a read-modify-write operation.

I don't think you understand how DMA works. No, you don't always R-M-W, not in case where pixels align to bytes. Granted, with 2-per-byte packing you still end up with a problematic case of odd columns, but that's one special case, not 8 plus plane switching.

Scali wrote:
Lolwat? You talk like you are an expert on graphics, but then you make all these weird, misguided statements.

No, I'm not an expert - and I have no idea how much DX11 code you have written in your life. But I have this nagging feeling I wrote considerably more. DX7, DX8, DX9 and DX11 code.

Scali wrote:
DX11 specifically introduced the concept of 'deferred context', which allows you to create command lists on separate worker threads.
Here's a presentation from NV on the subject

That's the official line, yes. And I've bolded the important part of your own sentence for you, read it again. API-wise, DX11 is a the same thing as DX10 except MS un-broke the reference counters in D3D objects - no doubt because the game devs working with 9c looked at the specs and said "Yeah, no." But what do I know.

Scali wrote:
TL;DR: you're talking BS.

Yes, I too think we are done here.

Reply 23 of 127, by Scali

Posted on 2019-05-31, 13:37

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

Deunan wrote:
I'm sorry, did I strike a nerve there? It was the Amiga comment, wasn't it?

No, it's your typical fanboy thought-pattern. Apparently you're convinced that VGA is the One True Way(tm) for graphics, and anything else has to be inferior. You merely try to collect 'proof' to support your notion, rather than performing any actual critical thinking.
Likewise, you're convinced that AMD is totally awesome, and any shortcomings can't be because of AMD's drivers or hardware, but have to be the fault of OpenGL, Microsoft or whatever.

Deunan wrote:
Early computing tried all sorts of things (VAX instruction set says hello), only to finally settle with solutions that actually work.

This exact statement shows the shortcomings in your thinking.
You assume there is only One True Way(tm), and all older systems must have been designed by retards, and, in your words "didn't actually work".
Reality says hello: hardware and software are not designed in a vacuum. They are designed within the limits of the technology of that era.
How CPUs and GPUs work today would fail completely in the 70s and 80s, because various assumptions made by current hardware designs simply didn't hold back in those days.

Which is exactly the point I'm trying to make with bitplanes: they are a very elegant and efficient solution under the right circumstances. The Amiga games I've shown, prove this beyond a shadow of a doubt: you can get very sophisticated and colourful games with lots of things moving on the screen and scrolling at the full 50 Hz. Something that VGA couldn't do until the early 90s, with much more advanced hardware and bruteforce.

Sure, once you have the brute force, it starts to make sense to do things the VGA-way. Problem is, VGA was designed in 1987, when this bruteforce was not yet available. In fact, bitplanes were the state of the art.
So if you want to argue that VGA is the One True Way(tm), you are completely ignoring a huge catalog of games on Amiga and similar systems, that simply weren't possible on VGA at that time (and what you call 'moment of glory' is a timespan from 1985 to about 1992-1993, and VGA was introduced slap-bang in the middle of that period, and failed to shine at any point).

Deunan wrote:
I'm aware bitplanes were used in other system - but perhaps because the chips were already there?

Again, making assumptions, trying to find 'proof' that your ideas are the One True Way(tm).
This is pointless. You can only think in black and white, and do not see the shades of gray.

Deunan wrote:
I'm going to skip the rest of the demos - this is all modern stuff. The way I see it, it only took some clever people 20+ years to finally hack some software workarounds for the stupid HW of the '80.

More assumptions. These were all demos from the early 90s, written for 286 or low-end 386 with early ISA VGA clones.

Deunan wrote:
As for Amiga - there's no reason to debate a dead system that had a moment of glory exactly because IBM dropped the ball so bad with CGA, EGA and then made VGA complicated and expensive. That was my very point, in case you missed it.

I thought your point was that VGA was awesome and the One True Way(tm) to do graphics.
Which it isn't.

Deunan wrote:
I don't think you understand how DMA works.

What does DMA have to do with anything?

Deunan wrote:
No, you don't always R-M-W, not in case where pixels align to bytes.

We were talking about bitplanes, weren't we? As in bitplanes the way EGA and VGA use them?
In which case, you get 8 pixels packed into a byte. You always need to do a read-modify-write, unless you want to overwrite all 8 pixels at the same time.

Deunan wrote:
No, I'm not an expert - and I have no idea how much DX11 code you have written in your life. But I have this nagging feeling I wrote considerably more. DX7, DX8, DX9 and DX11 code.

Well you aren't short on arrogance, that's for sure.
In my case, there's actually somewhat of a track record on my blog and my demoscene releases. Doesn't cover my professional career of course, but at least it's something.
And what exactly have you done?

Deunan wrote:
That's the official line, yes. And I've bolded the important part of your own sentence for you, read it again.

More fanboy knee-jerk.
I've also added an Intel presentation, says basically the same, and does interesting analysis on both NV and AMD implementations.

Deunan wrote:
API-wise, DX11 is a the same thing as DX10 except MS un-broke the reference counters in D3D objects

Say what now?
Deferred contexts didn't exist in DX10. They are the primary way to distribute workload over multiple threads. That's a pretty substantial difference if you ask me.
Again, look at the Intel presentation above, it goes into some detail on how that works under the hood. It also shows how AMD's driver basically blocks all worker threads while the main driver thread is running. NV's implementation does not.

Deunan wrote:

But what do I know.

Not a lot, apparently.

Last edited by Scali on 2019-05-31, 13:55. Edited 2 times in total.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 24 of 127, by rasz_pl

Posted on 2019-05-31, 13:41

rasz_pl Offline

Rank l33t

Rank: l33t
Posts: 3508
Joined: 2017-06-04, 00:57

well, actually! you are both wrong ;-]
Deunan: Planar makes perfect sense on slow cpu where every byte written matters. Drawing 1-bit image(sprite) into one bitplane can be up to 8 times faster than 13h.
Scali: unchained 13h hurt Doom performance (unless selecting low detail, afaik doom draws 2 bitplanes at a time in low), its only advantage was V-sync with no tearing (double buffering).

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Reply 25 of 127, by Scali

Posted on 2019-05-31, 13:48

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

rasz_pl wrote:
Scali: unchained 13h hurt Doom performance (unless selecting low detail, afaik doom draws 2 bitplanes at a time in low), its only advantage was V-sync with no tearing (double buffering).

Funny how people try to explain this sort of stuff to me.
Yes, obviously I know that there is some overhead to switching bitplanes. How much overhead this is, depends on the exact implementation.
Have you thought your second sentence through though? I don't think so.
You seem to think of tearing in the modern sense. Problem is, modern tearing is already double-buffered. It is 'acceptable' because you only see a horizontal divide when not vsynced.
In mode 13h however, you do not HAVE a second buffer. This means that you would draw directly on the frontbuffer. Since you cannot draw the 3d scenes in a nice left-to-right, top-to-bottom order, you would get very nasty random flickering.
With unchained mode, you can draw in an offscreen buffer and then perform a flip (basically the same as modern 3D hardware). That flip is v-synced, because that is just how VGA was designed: its registers are latched, and changing the screen offset will not take effect until the next frame starts. It wasn't a specific goal for the ID developers.

However, the alternative in mode 13h would be to draw the whole frame in a system-memory buffer, and then copy it to vram (so that the copy is left-to-right, top-to-bottom, and you can race the beam). How this balances out, depends on the speed of your system memory, vram, and bus.
On your average 386/486, it generally was faster to do it the way DOOM did (so ID and I weren't wrong). On later machines, system memory and buses were faster, so the balance changed. Which brings us back to the observation that there is not One True Way(tm) to do something, and hardware and software do not operate in a vacuum. The Wolf3D/DOOM approach made the most sense at the time. For Quake, the rules had changed, so they moved to mode 13h and a sysmem backbuffer (and fancy SVGA with linear framebuffers).

Last edited by Scali on 2019-05-31, 14:06. Edited 1 time in total.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 26 of 127, by rasz_pl

Posted on 2019-05-31, 14:06

rasz_pl Offline

Rank l33t

Rank: l33t
Posts: 3508
Joined: 2017-06-04, 00:57

Scali wrote:
rasz_pl wrote:
Scali: unchained 13h hurt Doom performance (unless selecting low detail, afaik doom draws 2 bitplanes at a time in low), its only advantage was V-sync with no tearing (double buffering).

Funny how people try to explain this sort of stuff to me.
Yes, obviously I know that there is some overhead to switching bitplanes.

the point is Doom didnt use planar for the awesome bitplanes, it used planar because Carmack didnt like tearing. Hardly a pro planar addressing argument.

Open Source AT&T Globalyst/NCR/FIC 486-GAC-2 proprietary Cache Module reproduction

Reply 27 of 127, by Scali

Posted on 2019-05-31, 14:08

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

rasz_pl wrote:
the point is Doom didnt use planar for the awesome bitplanes, it used planar because Carmack didnt like tearing. Hardly a pro planar addressing argument.

It wasn't a pro-planar argument.
It was an argument that apparently VGA had various shortcomings, requiring workarounds that gave up the One True Way(tm) of the chunky framebuffer, even in early textured 3D games, which VGA was supposed to be well-suited to.

Also, I still think you misunderstand the whole tearing-thing. Please re-read my above post. Not having a double-buffered solution would basically make the game unplayable. There were two ways to make it double-buffered, and we all know which mode he chose. We also know this wasn't by accident, because it required a lot of complex and clever coding, where the alternative is trivial to implement.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 28 of 127, by reenigne

Posted on 2019-05-31, 18:50

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 610
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

There are reasons why Wolfenstein 3D used unchained mode other than allowing the use of all 256kB of VRAM in order to have multiple pages and do triple buffering. The game also used the trick of copying 4 pixels at once with a single "movsb" instruction by using the VGA's internal latches (for the 2D status area in the main game). Also, for the 3D drawing, when the same pixels are drawn in 2 or more adjacent columns, this is optimised by setting the mask register to draw multiple planes at once, as in this mode the planes correspond to the low two bits of the x position rather than colour bits. The game's designers understood their target hardware very well and pushed it to its limits.

I haven't looked at the code for DOOM, but I expect that uses the same tricks.

Reply 29 of 127, by Scali

Posted on 2019-05-31, 18:55

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

reenigne wrote:
I haven't looked at the code for DOOM, but I expect that uses the same tricks.

Yes, I would expect it uses basically the same code for the walls and sprites (just higher resolution, better accuracy, and added lightmaps).
The biggest difference would be in the floors and ceilings, which would theoretically be a poor fit for unchained mode. But as I recall, they used some clever tricks with precalculated mapping tables, so they probably do not render the pixels in scanline order, but in some kind of order that minimizes the amount of plane switches.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 30 of 127, by Deunan

Posted on 2019-05-31, 19:04

Deunan Offline

Rank Oldbie

Rank: Oldbie
Posts: 1575
Joined: 2018-05-29, 12:32

rasz_pl wrote:
Deunan: Planar makes perfect sense on slow cpu where every byte written matters. Drawing 1-bit image(sprite) into one bitplane can be up to 8 times faster than 13h.

True, but then again if you are drawing sprites to VRAM you are doing it wrong. First, this only applies to simple 1-bit stuff. Second, a proper HW sprite engine would work on scanline basis and mix the colors in the ALUs, while fetching pixels to draw. No need for bitplanes. In fact, if you consider that the pixel data could first go through a LUT to provide a pallete color, then this whole "touch just 1 plane" system doesn't have any merit.

My point is, using CPU to do the drawing is a waste of precious clock cycles. Demos can because there isn't anything else to do (other than maybe play some tune). PC had DMA engine and that should've been the preferred method of writing to VRAM if you need it to be fast. You could always fall-back to CPU if you didn't care for speed and wanted to keep things simple.

It's like IBM figured that solutions from 8-bit computers, like C64 or ZX, are just fine on a "pro" machine that has separate VRAM and slow bus between it and the CPU.

reenigne wrote:
The game also used the trick of copying 4 pixels at once with a single "movsb" instruction

That is a nice trick but it only works if the pixels are the same color. And, frankly, it still wasn't fast enough for a typical 12MHz 286, especially considering that some VGA cards were just slow. Plus what you save on VRAM transfers gets diluted a bit by all the I/O you have to do to set the HW write masks. But OK, point taken, it is a valid argument - especially considering that ISA is kinda slow so this might still be faster than DMA writing all the bytes. But I do wonder if that'd be the case had the video cards been actually optimised for DMA from the start.

Reply 31 of 127, by Scali

Posted on 2019-05-31, 19:14

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

Deunan wrote:
True, but then again if you are drawing sprites to VRAM you are doing it wrong.

Well, how would you propose to do it on EGA/VGA?

Also, ironically enough, the Amiga did have hardware sprites, but most games used so-called Blitter Objects (bobs) instead.
Basically using the blitter to save and restore the background, and drawing sprites on top.
Because the blitter was so fast, you could use more colours/larger sprites than with the hardware solution.

Deunan wrote:
My point is, using CPU to do the drawing is a waste of precious clock cycles. Demos can because there isn't anything else to do (other than maybe play some tune).

What does that have to do with anything? Both demos and games try to go for the most efficient way possible.
Depending on the hardware and the goal, that may or may not be the CPU.

Deunan wrote:
PC had DMA engine

Lolwat?
Well, that pretty much proves you have no idea what you're talking about.
The DMA controller on the PC dates from the 70s, and is very limited. It is also very slow, because it was never upgraded from the official spec.
In fact, in the original PC and XT, it ran at a 4.77 MHz clock. When the AT came around with a 6 MHz 286, they had to run it at half clock, so 3 MHz, because the DMA controller could not run at 6 MHz. Even with the later 8 MHz AT, it still ran at only 4 MHz, so still slower than the original PC/XT.
The CPU was way faster.

Aside from that, the DMA controller was designed for a 16-bit address space, and has to use some hacky paging mode to be able to access the full 1 MB address space.
Because of these limitations it is virtually impossible to use the DMA controller efficiently for any kind of VRAM access.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 32 of 127, by reenigne

Posted on 2019-05-31, 20:36

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 610
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

Deunan wrote:
reenigne wrote:
The game also used the trick of copying 4 pixels at once with a single "movsb" instruction

That is a nice trick but it only works if the pixels are the same color.

That's not true. The VGA card has 4 latches, one byte's worth for each plane. With the proper read and write modes set, the read cycle of the "movsb" reads 4 bytes into the latches and the write cycle writes them back out of the latches to the new locations.

Deunan wrote:
Plus what you save on VRAM transfers gets diluted a bit by all the I/O you have to do to set the HW write masks.

If you transfer a significant amount of data between mask sets (e.g. a column's worth, or 1/4 of a sprite for a 2D blit from system RAM) the mask sets end up not taking a significant amount of time. It does make the code more complicated than a simple linear framebuffer would be, but not unfeasibly so.

Deunan wrote:
But OK, point taken, it is a valid argument - especially considering that ISA is kinda slow so this might still be faster than DMA writing all the bytes. But I do wonder if that'd be the case had the video cards been actually optimised for DMA from the start.

Sure, and VGA implementing it's own VRAM-to-VRAM DMA could potentially have been quicker still. But I think that chipset was already pushing the limits of complexity that were possible with technology of the time.

Reply 33 of 127, by Deunan

Posted on 2019-05-31, 21:26

Deunan Offline

Rank Oldbie

Rank: Oldbie
Posts: 1575
Joined: 2018-05-29, 12:32

reenigne wrote:
That's not true. The VGA card has 4 latches, one byte's worth for each plane. With the proper read and write modes set, the read cycle of the "movsb" reads 4 bytes into the latches and the write cycle writes them back out of the latches to the new locations.

Sorry, seems I misunderstood. Not "same color" then but an "identical 4-byte pattern". However, the original question was how is bitplane mapping superior to anything else. And this method might be easier to pull off with bitplanes but there's nothing preventing a video card from having internal data latches and linear memory model, right? Same technique - you populate the latches with data, then just trigger a write with one byte transfer. Because the performance win here is less data being transferred to the card, not how the memory is mapped.

reenigne wrote:
If you transfer a significant amount of data between mask sets (e.g. a column's worth, or 1/4 of a sprite for a 2D blit from system RAM) the mask sets end up not taking a significant amount of time. It does make the code more complicated than a simple linear framebuffer would be, but not unfeasibly so.

OK, but this is what we ended up with because the hardware was designed that way. And it might the fastest method for the crap HW we got, sure, but my claim was that it could've been better.

reenigne wrote:
Sure, and VGA implementing it's own VRAM-to-VRAM DMA could potentially have been quicker still. But I think that chipset was already pushing the limits of complexity that were possible with technology of the time.

My beef with VGA starts with CGA. I'm not asking for CGA to become SVGA on the first try but perhaps it could've been less stupid. And then the cards following it wouldn't have to emulate the stupid behaviour. By the time we had VGA pretty much all PCs were AT and ISA slots were 16-bit - some bus mastering was already possible. Sure it was a complicated chip but perhaps if we had the ability to tell the card to pull data from RAM, directly, and maybe even convert the format on the fly (or we could have simple RLE data packing) - it would be faster than having CPU work with bitplanes, latches, etc. So it would trade the complexity from one function to another, not add.

I mean, that's my whole point here. Not how fast what we have is, and how to extract that, but how fast it could've been had IBM actually thought about it some more. They got the text mode and attributes right, just one more step was required for the graphics modes. I know it was a limitation of the chip used, but then why not make a different chip. All the 8-bit computers did.

Reply 34 of 127, by Scali

Posted on 2019-05-31, 22:00

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

Deunan wrote:
I mean, that's my whole point here. Not how fast what we have is, and how to extract that, but how fast it could've been had IBM actually thought about it some more. They got the text mode and attributes right, just one more step was required for the graphics modes. I know it was a limitation of the chip used, but then why not make a different chip. All the 8-bit computers did.

I think the simple answer is that IBM simply didn't know how.
They didn't have the expertise of an Atari, Sega, Nintendo or Commodore on building clever video circuits with hardware sprites, raster interrupts and all that jazz.

Especially the original PC ('Project Chess') was developed by a very small team in a very short timespan. So that would explain why MDA and CGA are as crap as they are. Probably designed by engineers who didn't know a whole lot about graphics, but could get something on screen with off-the-shelf parts on short notice.

EGA was already a huge step forward (VGA is actually not that far from EGA, all the clever stuff was already developed for the EGA ALU). More time, more budget, more expertise. But still not exactly state-of-the-art, because IBM simply wasn't into graphics, games and such. You can clearly tell that the EGA designers didn't really know how to handle scrolling, and they hadn't made any attempt at sprites, blitting, or linedrawing either (you could say the same about Apple: big name, big budget, but their graphics chips were nothing to write home about).

I guess the question is: what would the PC platform have looked like if you had IBM's budget and the expertise of Atari, Sega, Nintendo, Commodore etc?

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 35 of 127, by reenigne

Posted on 2019-05-31, 22:11

reenigne Offline

Rank Oldbie

Rank: Oldbie
Posts: 610
Joined: 2006-11-30, 05:13
Location: Cornwall, UK

Deunan wrote:
reenigne wrote:
That's not true. The VGA card has 4 latches, one byte's worth for each plane. With the proper read and write modes set, the read cycle of the "movsb" reads 4 bytes into the latches and the write cycle writes them back out of the latches to the new locations.

Sorry, seems I misunderstood. Not "same color" then but an "identical 4-byte pattern".

Well, you'd get a repeating 4-byte pattern with "rep stosb" or "rep stosw". The latches combined with "rep movsb" gives you a speedup approaching a factor of 4 in copying blocks of data from one area of VRAM to another.

Deunan wrote:
However, the original question was how is bitplane mapping superior to anything else. And this method might be easier to pull off with bitplanes but there's nothing preventing a video card from having internal data latches and linear memory model, right? Same technique - you populate the latches with data, then just trigger a write with one byte transfer. Because the performance win here is less data being transferred to the card, not how the memory is mapped.

I guess we were coming at it from different angles: how to best utilize the VGA hardware vs how could the VGA hardware be improved if it was redesigned.

Deunan wrote:
OK, but this is what we ended up with because the hardware was designed that way. And it might the fastest method for the crap HW we got, sure, but my claim was that it could've been better.

Possibly. But I've always been very impressed with the VGA - it was extremely powerful and flexible for the time. And had the constraints of being software compatible with earlier video standards (which it did in a very general way). I'm curious about what you would change if you were leading the group that designed the VGA at IBM (and about whether that would have been feasible at the time).

Deunan wrote:
My beef with VGA starts with CGA. I'm not asking for CGA to become SVGA on the first try but perhaps it could've been less stupid. And then the cards following it wouldn't have to emulate the stupid behaviour. By the time we had VGA pretty much all PCs were AT and ISA slots were 16-bit - some bus mastering was already possible. Sure it was a complicated chip but perhaps if we had the ability to tell the card to pull data from RAM, directly, and maybe even convert the format on the fly (or we could have simple RLE data packing) - it would be faster than having CPU work with bitplanes, latches, etc. So it would trade the complexity from one function to another, not add.

Ability to use it on machines with an 8-bit bus was one of the VGA's design goals (I have an original IBM VGA card that I use with my XT). Another problem with giving the VGA card DMA capabilities might be that it could saturate the bus, starving the CPU from RAM access and stalling it, making it not significantly faster than CPU-controlled transfer (especially once you take into account setting up the DMA transfer and initializing the DMA controller). That is not so much of a problem with other DMA devices like floppy drives and soundcards.

Deunan wrote:
I mean, that's my whole point here. Not how fast what we have is, and how to extract that, but how fast it could've been had IBM actually thought about it some more. They got the text mode and attributes right, just one more step was required for the graphics modes. I know it was a limitation of the chip used, but then why not make a different chip. All the 8-bit computers did.

CGA had it's own different set of constraints, and was a very competitive card despite of them! The main one being that it was done on a very short timescale and limited up-front budget, meaning that they had to use off the shelf ICs (CRTC + RAM + character ROM + discrete logic) rather than creating a custom chip as was done for most of the contemporary micros. But the card they produced was sufficiently rich that we're still discovering new tricks 38 years later. Not to mention that the distinctiveness of its palette can trigger powerful nostalgia in people of a certain age!

Reply 36 of 127, by Scali

Posted on 2019-05-31, 22:22

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

reenigne wrote:
Another problem with giving the VGA card DMA capabilities might be that it could saturate the bus, starving the CPU from RAM access and stalling it, making it not significantly faster than CPU-controlled transfer (especially once you take into account setting up the DMA transfer and initializing the DMA controller).

This is an actual problem on the Amiga by the way. The blitter can fully saturate the bus, thereby pushing the CPU off.
The designers solved it in two ways:
1) The default mode of operation of the blitter is to release the bus every Nth cycle (I believe every 8th, from the top of my head), to prevent the blitter from completely starving the bus. You can set a bit known as 'blitter nasty', which doesn't release the bus, for better blitter performance, at the cost of starving the CPU completely.
Other chips have their own 'DMA channels' allocated into the architecture of the Amiga. Everything is synchronized to a scanline, so each device has a certain DMA 'slot' at a certain part of the scanline. They cannot be pushed off by the blitter (the C64's VICII chip has a similar concept, where each of the 8 hardware sprites has its own unique slot where a fetch of the sprite data can be done, if the sprite is enabled, which pushes the CPU off the bus).
Here is the picture from the Amiga Hardware Reference Manual that documents this scheme: http://amigadev.elowar.com/read/ADCD_2.1/Hard … e/node02D4.html
(and as you can also see: the more bitplanes you use, the slower your CPU gets).

Edit: Here is the description of the DMA allocation, and also the blitter-nasty bit: http://amigadev.elowar.com/read/ADCD_2.1/Hard … e/node012B.html
Apparently the blitter releases the bus when the CPU has tried to access the bus for 3 consecutive cycles.

2) The default pool of memory on an Amiga is known as 'chipmem'. This is memory that can be used by all custom chips, as well as the CPU. You can add memory expansions that are exclusive to the CPU, known as 'fastmem'. As long as you keep your CPU in fastmem, it won't get stalled by the blitter or any of the other custom chips. It's basically the inverse concept of the PC, where system memory is for the CPU only, and video has its own dedicated VRAM.

I can't help but to see the Amiga as a thing of beauty. Everything is so well-organized, so well-thought out, so systematic.
And it is super-flexible. There aren't really any 'video modes'. You can freely program any resolution you like, use any amount of bitplanes you want, etc. The really freaky part is that video modes 'do not stick'. That will throw you off at first.
At the basis of a CRT controller is usually a set of counter registers, which get re-initialized every frame. On Amiga, they do not. You have to do this yourself. Good thing it has the copper, which automatically runs a copperlist program at the start of each frame 😀 You set some of the screenmode constants in there, and then the copper will make sure they get sent to the chipset at every frame. If you don't, then your start addresses will just 'scroll through memory', and you get garbled junk on your screen 😀
The upside is that you can reprogram your 'videomode' any time you like. You can turn bitplanes on or off whenever you want, even switch between interlaced and regular display, or hires and lores, whenever you feel like it.
You get full freedom to use the hardware however you like it. As far as I know, there is nothing quite like the Amiga.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 37 of 127, by Deunan

Posted on 2019-06-01, 10:21

Deunan Offline

Rank Oldbie

Rank: Oldbie
Posts: 1575
Joined: 2018-05-29, 12:32

reenigne wrote:
I'm curious about what you would change if you were leading the group that designed the VGA at IBM (and about whether that would have been feasible at the time).

Well, I haven't thought about it that much. It was more of a what-if question, but I have 2 things in my head:

For "VGA": the "smart DMA" that would make the card pull and convert or interpret the data on-the-fly rather than just dumb RAM to VRAM copy would be nice. That would allow you to prepare a pixel group in system RAM, without having to worry about packing - just a pixel value per byte or word, and then let the card handle it.
I mean, I can't really take credit for this idea - that's exactly what we ended up with on some later GPUs for, say, YUV to RGB conversion. And DMA or bus mastering in general being with us from PCI era until today, so clearly that was the way to go. Way I see it, we didn't need that clunky bitlplane stuff to make the early CPU go fast, it could've been done the right way from the start. And if that took any changes to the IBM PC to make it happen? So be it. We did move from XT to AT, increased the bus size, IRQ/DMA channel count, then we had PCI - it was all about incremental changes for the better.

Now, a DMA can saturate the bus, true, but if that is happening then you are trying to push too much data through it. How is that different from CPU not being fast enough to push all that by itself in the same amount of time? If it turned out that CPU writes are faster then PC DMA, then this DMA could've simply be upgraded with another controller that is faster and could do microbursts, and pace itself rather than transfer everything in one go. Then you just need your code to work on a pixel group the size of the burst, while DMA is transfering the previous group. That would steal some cycles from you but not halt the CPU completly, so this works faster than having the CPU do everything.

For "CGA": I'd give it 32k RAM so that it could do 320x200 in 16 colors, and 640x200 in 4 colors. Simple 2 or 4 pixels per byte. Ideally it would use EGA-like 2-bit per pixel output but I don't want to sound like I'm just replacing CGA with EGA-lite. So let's stick to original RGBI. Then I would add palette LUT, or 2 in fact. Even and odd pixel columns would use LUT0 and LUT1, and those would be independent. This would also be swapped every row, like this (row (y), column (x)):

0,0: L0; 0,1: L1; 0,2: L0; 0,3: L1; ...
1,0: L1; 1,1: L0; 1,2: L1; 1,3: L0; ...

That's easy to do, a couple of XOR gates on the counter lowest bits to drive enable signal from the correct LUT to the output amps. Those LUTs would be small enough to use SRAM cells inside the GPU chip itself, but external SRAM also works. This way not only you are not forced to use "blue or not blue" colors, but you could very cheaply do dithering - if nothing else, to be used while showing static images. But it should be fast enough to use in games and having palletes and HSYNC interrupt allows for all kinds of cool color increasing tricks - if you like demos.

As silly as they might be, I like both of these ideas way more than bitplanes. Let's not forget that PC was meant to be a serious business machine. It's not like these would drive the costs to unacceptable levels.

reenigne wrote:
CGA had it's own different set of constraints, and was a very competitive card despite of them! The main one being that it was done on a very short timescale and limited up-front budget, meaning that they had to use off the shelf ICs (CRTC + RAM + character ROM + discrete logic) rather than creating a custom chip as was done for most of the contemporary micros. But the card they produced was sufficiently rich that we're still discovering new tricks 38 years later. Not to mention that the distinctiveness of its palette can trigger powerful nostalgia in people of a certain age!

Ha, I suppose I just don't see that "38 years later" as a positive thing. It should've been so easy to use that people could utilize 99% of its performance a year after it was released. I get the nostalgia, I truly do. I very vividly rembmer CGA colors. I just wouldn't use the word "fondly" is what I'm trying to say.

Last edited by Deunan on 2019-06-01, 10:31. Edited 1 time in total.

Reply 38 of 127, by Scali

Posted on 2019-06-01, 10:30

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

Not really sure what your point is. Could the PC have been designed better? I think everyone agrees on that. Of all the machines from the 70s and 80s, the PC is by far the least spectacular (which is why it was so easy to clone). But it is what it is.
Sure, you could add clever DMA etc... make sort of an Amiga out of it (that is what you're suggesting basically, even though you're probably not aware). But what's the point? That's not what a PC is. And it's not what a PC will ever be.
You can look at various other systems to find their clever (or in some cases not-so-clever) solutions to the various problems. But then what? All these systems are what they are, and they will never change.

Some high-end harddisk controllers did add their own high-speed DMA controllers on the card by the way. With video cards, I don't think DMA on the card was done before the PCI-bus and bus-mastering were common. Best example of that is probably the PowerVR 3D accelerator: it renders its 3D scene into the VRAM of the host videocard.

I also think that the lowest-common-denominator is what killed most fancy video card options.
Various SVGA cards had their own bitblitter. But software only supported standard VGA, and never made use of these fancy features. Which meant that they were effectively only "Windows accelerators": the features were used by their custom Windows drivers. But not in games.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 39 of 127, by Scali

Posted on 2019-06-01, 13:20

Scali Offline

Rank l33t

Rank: l33t
Posts: 4873
Joined: 2014-12-13, 14:24

Scali wrote:
Sure, you could add clever DMA etc... make sort of an Amiga out of it (that is what you're suggesting basically, even though you're probably not aware). But what's the point? That's not what a PC is. And it's not what a PC will ever be.

In fact, here's an interesting computer for you:
https://en.wikipedia.org/wiki/Mindset_(computer)

It's x86-based and runs DOS, but it was designed by ex-Atari people (like the Amiga), and uses a clever graphics chip to relieve the CPU of drawing duties.

(Also, you are negative about programmers trying new tricks 30+ years after the hardware was designed, yet you are offering design ideas for this hardware 30+ years after. The irony is not lost on me)

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Main menu

Common searches

Topic actions

Reply 20 of 127, by Deunan

Reply 21 of 127, by Scali

Reply 22 of 127, by Deunan

Reply 23 of 127, by Scali

Reply 24 of 127, by rasz_pl

Reply 25 of 127, by Scali

Reply 26 of 127, by rasz_pl

Reply 27 of 127, by Scali

Reply 28 of 127, by reenigne

Reply 29 of 127, by Scali

Reply 30 of 127, by Deunan

Reply 31 of 127, by Scali

Reply 32 of 127, by reenigne

Reply 33 of 127, by Deunan

Reply 34 of 127, by Scali

Reply 35 of 127, by reenigne

Reply 36 of 127, by Scali

Reply 37 of 127, by Deunan

Reply 38 of 127, by Scali

Reply 39 of 127, by Scali