The K5 was an in house design by the same team who designed AMD old RISC processors (which were quite popular in embedded applications back in the '90s). If I remember correctly, the K5 core logic was to be their next RISC processor, but AMD also had to come up with a new x86 processor, sine the market for 486 was dwindling down. The early K5s were a disappointment, barely faster clock for clock than a Pentium and also realized on old process. The redesign arrived only in the second half of 1996 and was a fast and efficient processor, although its short pipelines and complex design didn't allow to scale well. Oh, its FPU is supposed to be very fast: even if it's not pipelined it has a very low latency. If I remember correctly, in the 133MHz test someone mentioned above, clock for clock it placed right behind the original P5 in DOS Quake. Since Quake DOS graphic routines were hand coded and optimized for the P5 architecture, it wouldn't be far fetched to think that if somebody optimized the code for the K5, it would probably place first.
Much of the K6 design was a contribution of AMD acquisition of the NexGen team of engineers. It's a different design. Much of its speed derives from having 2x 32KB caches, an additional 20KB 'pre-decode' cache (it's not a true instruction cache like Pentium IV 'trace cache', although it's also way to speed up the fetch, align and decode operations) and last it has a 8KB BTB. It has longer pipelines than the K5 and this allows it to scale up to 500Mhz at 0.25u and around 750 MHz at 0.18u, although the fastest commercial version never went beyond 570MHz (95Mhz x6).
The K6-2 introduced 3Dnow instructions in which a pair of 32 bit floating point instruction are processed in parallel and stored in a single 64 bit floating point register. Sort of MMX but for floating point numbers. Possibly, the best addition of 3dNow are the approximated reciprocals and approximated reciprocal square roots of a pair of floating point numbers. Some RISC architectures already had these instructions which were useful in 3D graphics before T&L became a thing. Unfortunately for AMD, I don't think any DOS software made use of 3DNow. A 3DNow optimized version of DOS Quake would have blown the P5 and even P6 out of the water.
With 3Dnow, not only 3D computations could have been done in parallel on two numbers at the same time , but since the x87 architecture has only 8 registers arranged as a 'shallow' stack, halving the register requirements would also have made quite a difference.
In addition to that, texturing done right (no wobbly pixels like on the Playstation!) requires perspective correction using a divide instruction. With 3Dnow you could do two in parallel in much less time, using the approximations. IF I remember correctly, also the first Voodoo card needed dX/dW and dY/dW perspective correction parameters computed by the CPU for each vertex.
Reciprocal square root is used in light calculations and it's so important that, in Quake III, John Carmack wrote a software routine that computed the approximation much faster than using the FPU. With 3Dnow you could have a single, dedicated instruction for that. How's cool?
But, like I said before, none of that is useful under DOS. To my knowledge, nobody ever optimized Quake or any other early 3D game running under DOS for these extensions. Even AMD abandoned 3Dnow for SSE with second generation Athlons.
Last, the K6-2 core moved the FXCHG instruction from the FP unit to one of the integer units and this is possibly the only modification that is beneficial to DOS Quake, because this instruction is used a lot.
K6-III was a K6-2 with a larger, slightly slower, unified cache that acts as a L2 cache, except that it runs off its own dedicated bus and at the same frequency of the core. It allows the CPU to scale better with frequency as it decreases the need to fetch data over the slower external memory.