I've contributed to performance white papers and suffice to say, unless your workload is an exact duplicate, you should take it with extreme prejudice. By the time a whitepaper is published publicly, most meaningful engineering data has already been scrubbed clean (unless it conformed to marketing's lofty goals). However, the configuration options and methodology are useful for an end-user to re-run the tests with their own workloads.
That said, I'm trying to find early Netburst era (e.g. pre-Prescott) whitepapers and online reviews, and unfortunately, seems like a lot of them are gone or the images/graphs are missing.
I did come across this IBM whitepaper:
https://www.ibm.com/developerworks/library/l-htl/index.html
Immediate red flag. They are using Xeon MP in a single CPU configuration. Which is ridiculously not real-world (end users do not pay a massive premium for a 4S system to use 1CPU). Based on my experience with marketing magic, the original engineering team probably tested 1CPU, 2CPU, and 4CPU configurations. The 2CPU and 4CPU likely did not make the marketing cut off, so a decision was made to only publish the 1CPU Xeon MP results. If the 2CPU and 4CPU data were as good as the 1CPU data, there is no reason to withhold it, as it would presumably help sales and drive ASP.
Also, the reason I brought up cache being a huge problem with early Netburst (where they had small amounts of L1/L2) was that even though in a SMT scenario you don't snoop the cache of the same logical core, you can still run into potential cache contention issues. The cache itself is not SMT aware, so you may run into cases where one thread can cache starve another thread. Since the other thread is starved, it will just limp onwards to completion. In the absolute worst case scenario, both threads have completely different referenced memory.
Now moving on to early Xeon MP specific issues, their biggest problem was the bus constraints (all 4 CPU's shared the FSB to the MCH) and cache coherency. Cache matters more in an MP scenario (unlike in SP), as you may need to keep coherency on up to 4 processors. In fact, as I recall, a significant amount of the FSB bandwidth was wasted just on snooping for the early Xeons. This in turn, leaves a less than desired amount for memory access. In fact, generally the worst performers for HT enabled are memory bandwidth intensive applications. I don't think this was completely resolved until Intel went to NUMA on Nehalem.