VOGONS


First post, by 386SX

User metadata
Rank l33t
Rank
l33t

Hi,

I was reading some magazine of the 1995 and everywhere you could read, just as I remember, was underlined the greatness of the original Pentium architecture. It makes me think at that time when 486 was pushed at at the already incredible 100Mhz target (ok more..) but Pentium architecture when released seems to win on everything. And with the release of W95 and first software rendered "2D/simple 3D" SVGA games, could we think this processor (let's say the Pentium 100), as the biggest tech jump ahead in the cpu story also until the original Athlon? Or at least the greatest destkop hw/sw moment?
Bye

Reply 1 of 34, by Scali

User metadata
Rank l33t
Rank
l33t

Well, 'the greatest', I'm not sure of (386 was also a huge leap, and 286 was not bad either), but the Pentium was indeed a huge leap forward.
It mainly had three rather revolutionary changes compared to the 486:
1) It had a 64-bit data bus instead of 32-bit, so it could access memory twice as fast.
2) A superscalar architecture, meaning that it actually has two execution pipelines that can work in parallel, where earlier CPUs only had one pipeline. Best-case your code could literally run twice as fast as on a 486, because it pretty much was as if you had two 486 CPUs running side-by-side.
3) A much much much much much faster FPU.

This makes that on well-optimized code, a Pentium 60 can indeed outperform a 486-100 with ease.

For me personally it is the nicest x86 to write assembly code for. It is the last in-order execution architecture, so the last x86 to execute code exactly the way you write it. There is a certain elegance to ordering your code to maximize the use of the U and V pipelines.

The Pentium Pro was also a big leap, but not in terms of performance... at least, at first.
The architecture allowed much higher clockspeed scaling though, which means that we went from the ~200 MHz limit of the Pentium to 1+ GHz in a record time (Pentium II and III used the same basic architecture as the PPro, with MMX and SSE added, and some advances in manufacturing).
The original Athlon isn't a big tech jump, it's pretty much a clone of the Pentium Pro architecture.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 2 of 34, by RacoonRider

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

The original Athlon isn't a big tech jump, it's pretty much a clone of the Pentium Pro architecture.

Could you provide more details proving the statement? Because they don't look similar to me:
k7-architecture.jpg
p3-architecture.jpg

Even if AMD made a clone, why then would intel struggle to outperform it?

Reply 4 of 34, by Scali

User metadata
Rank l33t
Rank
l33t
RacoonRider wrote:

Could you provide more details proving the statement?
...
Even if AMD made a clone, why then would intel struggle to outperform it?

I can't see the images, but basically P3 and Athlon perform about the same at the same clockspeed, with AMD having a slight advantage in FPU-heavy code, but nothing significant (5-10% best-case). Nowhere near the difference between a Pentium and 486 at least.
Intel didn't exactly 'struggle', it's just that the Athlon was released only a few weeks before Intel introduced the Pentium 4 (although they still showed what PIII can really do with Tualatin. That's the one I'm talking about, pretty much on par with Athlon... I would think the benchmark is somewhat skewed because that Athlon already had the double-pumped FSB, while the P3 they tested was still stuck at 100 MHz). The Pentium Pro/II/III architecture was already a few years old by then, and AMD had only just caught up.

Just look at the micro-architectures of the Athlon and Pentium Pro, and you'll see they're very similar.
Both are out-of-order architectures, with a number of execution ports, and a capability of decoding up to 3 x86-instructions per clk. They both have a pipeline of about 10 stages.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 5 of 34, by alexanrs

User metadata
Rank l33t
Rank
l33t

I believe the Athlon's main advantage over the P3 was the double-transfer FSB, where the Pentium 3 was already being bottlenecked by its own lack of bandwidth on the FSB. Case in point, the Pentium M, which is basically an updated P3 but running under a P4 architecture outperformed nearly everything per clock when put on a desktop motherboard.

Reply 6 of 34, by sunaiac

User metadata
Rank Oldbie
Rank
Oldbie

The pentium was just a normal jump at that time.

Of course from the year 2015, where an awesome jump is 5% per generation since sandy bridge, it looks awesome.
But It was "just" a "up to twice as fast" 486.
Which was a "up to twice as fast" 386.
Which was a "up to twice as fast" 286 (running 32 bit code that would arrive when the 486 was mainstream 😁)
Which was a "up to twice as fast" ...

None of the technologies inside where revolutionnary, they were just brought to the x86 world from outside, or were more of the same.
I'm not trying to diminish the technical achievement of course, which was great, I just mean that they were not that different than the previous generational jumps.

The K6 --> K7 jump is also a great performance gain. Less on the ALU side, more on the FPU side.

Comparing Athlon to P6 is a bit tricky. None is "superior".
Basically, the Athlon is superior in compute power, but has often been limited by memory subsystem performance.
The 500-600MHz athlons were quite faster (as in 20%) than the Katmai (also limited by memory) clock for clock.
The coppermine had a much better memory system than ... well anything else, and 550-700MHz coppermine and athlons are about equals (thank you 128kb L1 cache).
The 800-1000MHz athlons were slower than cumines (but you could actually buy them)
The thunderbird came back to cumine levels on SDR100.
And a DDR133 thunderbird will be the fastest. Cumine/tualatin probably wouldn't have gained much from DDR133 (see i820/i840 results ...)

R9 3900X/X470 Taichi/32GB 3600CL15/5700XT AE/Marantz PM7005
i7 980X/R9 290X/X-Fi titanium | FX-57/X1950XTX/Audigy 2ZS
Athlon 1000T Slot A/GeForce 3/AWE64G | K5 PR 200/ET6000/AWE32
Ppro 200 1M/Voodoo 3 2000/AWE 32 | iDX4 100/S3 864 VLB/SB16

Reply 7 of 34, by Scali

User metadata
Rank l33t
Rank
l33t
sunaiac wrote:

The pentium was just a normal jump at that time.

We're talking 100+% performance at the same clock speed, not exactly a 'normal jump' (don't forget that 100+ MHz 486s did exist, but were released *after* the Pentium. When the Pentium 60/66 were released, the fastest 486 was still the DX2-66).
Possibly the largest jump in the history of the x86.

sunaiac wrote:

Which was a "up to twice as fast" 386.

Not at all, the 486 is only marginally faster than the 386 at the same clockspeed. Which is why the 386DX40 was such a popular option for gamers on a budget.

sunaiac wrote:

Which was a "up to twice as fast" 286 (running 32 bit code that would arrive when the 486 was mainstream 😁)

Nope, in fact, a 386SX is no faster at all than a 286, and in some cases even somewhat slower. A 386DX is somewhat faster than a 286 at the same clock speed, but certainly not 'up to twice as fast' (remember these systems were quite bottlenecked by ISA buses and slow memory).

sunaiac wrote:

The K6 --> K7 jump is also a great performance gain. Less on the ALU side, more on the FPU side.

Mostly because the K6 was an outdated underperforming architecture. Still Pentium-class technology, in the Pentium II/III-age.
The jump was large for AMD perhaps, but K6 was in no way competitive with Intel's offerings. So it wasn't a huge jump in the x86-world. AMD just closed the gap with Intel.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 8 of 34, by Trevize

User metadata
Rank Newbie
Rank
Newbie
Scali wrote:

It mainly had three rather revolutionary changes compared to the 486:

It's rather four. The fourth is branch prediction. Which means for most of the time the CPU could guess where the loop will jump. In case of a cache HIT, the instruction pipeline don't have to be emptied, which means a lot cycles. To avoid emptying the instruction pipeline on the 386, 486, some of the loops can be unrolled, which means their body is copied a couple of times after one other.

Edit: typo fix.

Last edited by Trevize on 2015-09-10, 16:08. Edited 1 time in total.

Reply 9 of 34, by sunaiac

User metadata
Rank Oldbie
Rank
Oldbie
Scali wrote:

A demonstration of what double standard means

Yes, with well chosen comparisons you can prove anything, that's true.
Thing is, I didn't think that was the point here.

But, I can do it too :
A 386 DX doing only memory copy operations will be twice as fast as a 286.
A Pentium running non cache aware non superscalable non FPU code will be marginaly faster than a 486.
And hey, I can even find tests in which a 800MHz athlon on KX133 is nearly twice as fast as 800MHz PIII on BX100 : http://www.anandtech.com/show/498/16

But I really wonder what would be my interest in doing that kind of things.
So I'd rather say the truth.
P54 is up to twice as fast as 486 which is also up to twice as fast as 386, and each can be just slightly faster in well chosen situations.

R9 3900X/X470 Taichi/32GB 3600CL15/5700XT AE/Marantz PM7005
i7 980X/R9 290X/X-Fi titanium | FX-57/X1950XTX/Audigy 2ZS
Athlon 1000T Slot A/GeForce 3/AWE64G | K5 PR 200/ET6000/AWE32
Ppro 200 1M/Voodoo 3 2000/AWE 32 | iDX4 100/S3 864 VLB/SB16

Reply 10 of 34, by Trevize

User metadata
Rank Newbie
Rank
Newbie
Scali wrote:
sunaiac wrote:

Which was a "up to twice as fast" 386.

Not at all, the 486 is only marginally faster than the 386 at the same clockspeed. Which is why the 386DX40 was such a popular option for gamers on a budget.

Sunaiac has is right here: the 486 is clock-by-clock twice as fast as a 386, for various reasons:
* It has 8 KB on-die, L1 cache: The models up to DX/2 has Write Through, the DX/4 models has 16 KB Write Back cache, which caches memory writes too
* It was the first step in the direction of RISC. There are lots of ASM operations which become twice as fast to execute than on a 386

For more information on the topic I'd recommend you to check the Big Book of Graphics from Michael Abrash which the author made freely available in 2001. I've read you like to optimize to the P54 architecture, you will *love* this book. 😀

Last edited by Trevize on 2015-09-10, 16:08. Edited 1 time in total.

Reply 11 of 34, by Trevize

User metadata
Rank Newbie
Rank
Newbie
sunaiac wrote:

A 386 DX doing only memory copy operations will be twice as fast as a 286.

As long as the data you are copying are LONG aligned (to the 4 byte boundary), and you are copying data in 32-bit pieces. If any of these fail, the speed will fall back, although I can't tell you exact numbers.
The same applies to 286 vs. 8086: the 286 can be slower (on the same clock speed) when the data you are accessing is not WORD aligned (2 byte boundary).

Reply 12 of 34, by shamino

User metadata
Rank l33t
Rank
l33t

The Pentium was a significant update to x86, for sure. I'm not sure if it's any more significant than the 386 though, and I think the Pentium Pro may be more significant technically, though it wasn't a mainstream chip. I don't think the 486 gets many originality points but I think it was a great CPU.

Scali wrote:

For me personally it is the nicest x86 to write assembly code for. It is the last in-order execution architecture, so the last x86 to execute code exactly the way you write it. There is a certain elegance to ordering your code to maximize the use of the U and V pipelines.

I remember being intrigued by how programming would have worked on the Itanium, but I can't remember the details anymore. The feature I do remember is that it would have allowed an assembly programmer (or the compiler) to finally have direct control of the branch prediction behavior, instead of just hoping that the dumb CPU would guess correctly.
I don't remember if it addressed the issue of out of order execution, or if it allowed more direct control of multiple pipelines.
It surely would have done away with the CISC->RISC translation barrier, since there would no longer be any need for that. The programmer would have been talking to the real CPU instead of the imitated CISC layer that other CPUs present.

It seemed like the Itanium was designed to relocate optimization decisions back to the code instead of imposing them rigidly in the silicon. From a programming point of view, the Itanium looked really cool.
I think Itanium was an impressive, completely new architecture, but of course it failed due to the real world reliance on x86 which it was making a break from.

Meanwhile AMD was working on the K8, which I find to be one of the most significant updates to x86. It started x86-64 and NUMA. I think it may have also been the first mainstream non-laptop CPU to ship with multi-VID P-states that really work, which made the power management much more useful and effective than older CPUs.
When I first read about the K8's "HyperTransport" setup, I was feeling some doubt as to whether AMD would get it to work reliably without huge delays. Thankfully they were able to make it work and ship it while Intel was still flat footed.
I think of the K8 as the first "modern" x86 by the standards of today. The P4 and K7 seem a lot more outdated in comparison.

Reply 13 of 34, by Scali

User metadata
Rank l33t
Rank
l33t
sunaiac wrote:

Yes, with well chosen comparisons you can prove anything, that's true.
Thing is, I didn't think that was the point here.

It wasn't.
I was talking about practical situations.
Back in those days, I was specializing in assembly optimizations, so I was up close and personal with the microarchitectures.
What I am talking about are practical, real-world examples (best possible example for Pentium vs 486 is Quake, everyone knows about that one).

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 14 of 34, by Scali

User metadata
Rank l33t
Rank
l33t
Trevize wrote:
sunaiac wrote:

A 386 DX doing only memory copy operations will be twice as fast as a 286.

As long as the data you are copying are LONG aligned (to the 4 byte boundary), and you are copying data in 32-bit pieces. If any of these fail, the speed will fall back, although I can't tell you exact numbers.
The same applies to 286 vs. 8086: the 286 can be slower (on the same clock speed) when the data you are accessing is not WORD aligned (2 byte boundary).

Problem is, it's just a single operation. I was talking about practical situations for entire applications. Even though a 386 may in theory do memcpy() twice as fast as a 286, it won't make your application twice as fast.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 15 of 34, by Stiletto

User metadata
Rank l33t++
Rank
l33t++
shamino wrote:

I remember being intrigued by how programming would have worked on the Itanium, but I can't remember the details anymore.

Lucky for you, last month Raymond Chen wrote a ten-part article on the Itanium processor architecture, as employed by Win32. 😁
Part 1: http://blogs.msdn.com/b/oldnewthing/archive/2 … 7/10630772.aspx
Part 2: http://blogs.msdn.com/b/oldnewthing/archive/2 … 8/10631038.aspx
Part 3: http://blogs.msdn.com/b/oldnewthing/archive/2 … 9/10631311.aspx
Part 3b: http://blogs.msdn.com/b/oldnewthing/archive/2 … 0/10631650.aspx
Part 4: http://blogs.msdn.com/b/oldnewthing/archive/2 … 0/10631649.aspx
Part 5: http://blogs.msdn.com/b/oldnewthing/archive/2 … 1/10631975.aspx
Part 6: http://blogs.msdn.com/b/oldnewthing/archive/2 … 3/10632545.aspx
Part 7: http://blogs.msdn.com/b/oldnewthing/archive/2 … 4/10632777.aspx
Part 8: http://blogs.msdn.com/b/oldnewthing/archive/2 … 5/10633044.aspx
Part 9: http://blogs.msdn.com/b/oldnewthing/archive/2 … 6/10633287.aspx
Part 10: http://blogs.msdn.com/b/oldnewthing/archive/2 … 7/10633553.aspx

😁

"I see a little silhouette-o of a man, Scaramouche, Scaramouche, will you
do the Fandango!" - Queen

Stiletto

Reply 16 of 34, by Scali

User metadata
Rank l33t
Rank
l33t
shamino wrote:

the Pentium Pro may be more significant technically

Does that matter? The Pentium II was basically a Pentium Pro with MMX added, with some small tweaks (mainly to fix the performance issues in 16-bit code). And the PIII was a PII with SSE added and some more small tweaks.
They are all known as the P6 microarchitecture.

shamino wrote:

I remember being intrigued by how programming would have worked on the Itanium, but I can't remember the details anymore. The feature I do remember is that it would have allowed an assembly programmer (or the compiler) to finally have direct control of the branch prediction behavior, instead of just hoping that the dumb CPU would guess correctly.
I don't remember if it addressed the issue of out of order execution, or if it allowed more direct control of multiple pipelines.

The Itanium worked with 'code bundles', where you could put 3 instructions in a single 'instruction word'. It could execute two code bundles at a time.
When a branch occurs, you could split it up so that it would execute bundles from both sides of the branch, and by the time the result of the branch was known, it knew which of the two was the correct result.
IIrc you could also put in a 'hint' for the branch predictor. That is, you can encode a number of bits which predict the behaviour. This is basically the same as what a branch predictor does internally, but it needs to 'learn' the behaviour first when the code runs. In this case the compiler 'learns' the prediction, and encodes it in the instruction.
There are various RISC CPUs that do this anyway.

For x86 there's a simpler variation: if you jump with a negative offset, it assumes that the code is a loop, so it predicts that the jmp should be taken, else it predicts not to take it.
So when writing code, you can take advantage of that.
There's also a prefix byte, I believe it's the ds: and es: prefix codes you can use to force the predictor to take the jump or not.

shamino wrote:

The P4 and K7 seem a lot more outdated in comparison.

The P4 was modern in other ways though. Its trace cache was very innovative, and Intel brought it back in the Core-architecture.
The double-pumped ALUs were also an interesting concept. I wonder if that ever makes a reappearance.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 17 of 34, by 386SX

User metadata
Rank l33t
Rank
l33t

I don't understand why K6-2 were release and lasted so long without an internal cache. Considering that the Pentium 2 and Celeron clearly teached how much cache was important, why K6-3 could not be released as next K6 after the 266Mhz version?

Reply 18 of 34, by Scali

User metadata
Rank l33t
Rank
l33t
386SX wrote:

I don't understand why K6-2 were release and lasted so long without an internal cache. Considering that the Pentium 2 and Celeron clearly teached how much cache was important, why K6-3 could not be released as next K6 after the 266Mhz version?

My guess is that it had to do with two things:
1) AMD started out with CPUs that were replacements for Pentiums on Socket 7 boards. There was no infrastructure for AMD-specific boards/chipsets at first.
2) On-chip L2 cache makes the chips a lot larger, and more difficult/expensive to build. AMD's manufacturing may not have been ready for it yet (Pentium II didn't have internal L2-cache either, it had some chips on the PCB... it wasn't integrated until the P3. So a PII could only work with a slot approach, not with a socket, because it would be too large).

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 19 of 34, by alexanrs

User metadata
Rank l33t
Rank
l33t
386SX wrote:

I don't understand why K6-2 were release and lasted so long without an internal cache. Considering that the Pentium 2 and Celeron clearly teached how much cache was important, why K6-3 could not be released as next K6 after the 266Mhz version?

Because it performed well enough, and the K6-3 was more expensive/harder to manufacture. In non-FPU intensive software, despite it being more obsolete technology than the P6 processors, it still offered competitive performance.