Interesting stuff.
Back in the day, I was sticking the cheapest new CPUs, Cyrix DX2s, on the cheapest used 486 boards, 5V boards with only the 4 pin SX/DX jumper for CPU configuration, and thus thought that Cyrix were "always" 5-10% slower than Intel. In this situation it was highly unlikely the boards were Cyrix aware, and wouldn't have been using writeback. Probably they weren't even that good a board for an Intel DX2, first gen 486 stuff, made for when DX33 was the top end.
Anyway, so I was somewhat surprised on the retro-revisit that Cyrix DX2s could sometimes be a tad faster, with a heaping tablespoon of "it depends". There also seemed to be situations in which intel 486 seemed "mysteriously" faster. By that I mean the performance delta between previous CPU like 386 and subsequent CPU like Pentium, seemed to move.
There's a tendency to think of the i486 as a tidied up 386, and the Pentium as a big leap forward, but what if it's not quite like that? In wikipedia article and other references you see i486 described as single pipe, tightly pipelined. However, a couple of months back, I was reading a deep dive on i486 that seemed to be saying that for some simple instructions, parallel execution was possible in the i486. This is very limited compared to Pentium, and the contrived "best case scenario" for it may not have a whole heap of real world uses. But in something like quake, pentium optimised code, what if it's catching a dual capable instruction pair every now and again? Not hitting them regularly like the dual pipelined Pent, but just every few instructions it gets a pair it can do. Then also you can imagine intel "strongly encouraging" benchmarks to throw a few of these in. Maybe smokescreening it as "necessary for future pentium compatibility" but also giving the i486 a little boost.
OTOH, Cyrix 486 may have been more of the expected "tidied and tightened up" 386, which had a visible step in it's evolution in the 486SLC/DLC class versions. Therefore their ALU may have been working at higher efficiency as a single unit without the help of occasional parallelism, which would explain how when they tuned it further and put two together in dual pipelined 6x86 it took a leap in integer performance past intel's.
However unless anyone knows what I'm talking about, I guess I have to try to rethread my adventures through technical analyses and deep dives I did some months back to try to dig up where it was spelled out.
Edit: Not the thing I was looking for but interesting paper on x87 access differences and goes on to CPU/FPU parallelism in 486 https://dl.acm.org/doi/pdf/10.1145/111048.111052
Unicorn herding operations are proceeding, but all the totes of hens teeth and barrels of rocking horse poop give them plenty of hiding spots.