Well, I have checked out gcc-3.4.1 now, and it seems the profile feedback optimizations are really important. Some switches have been renamed, and the docs say -funroll-loops is enabled when using profile data. gcc avoided any unrolling optimizations before, you had to enable them separately. I interpret this as: Loop unrolling is useful when using profile feedback, but not so much otherwise.
I have not benched anything yet, but my subjective impression is that the following CXXFLAGS improve things, but barely noticeably:
-O3 -fmerge-all-constants -funroll-loops -ffast-math -funswitch-loops
-march=... is important, but difficult to use for compilations across the board. You should definitely use -mtune=<most-common-cpu>, however.
If you have the patience, add -fprofile-generate to the switches, compile, run dosbox with a few different games, using different video modes, cpu instructions, scalers and sound cards. (I'd love to hear which ones you take. I tried Settlers for protected mode+svga+gus and MOM for hq2x+sblaster. Didn't have other ideas yet, no useful real mode program yet). After running the sample programs, reconfigure, replace -fprofile-generate with -fprofile-use, make clean, make.
Unfortunately, a real dosbox benchmark is still missing. I can bench gcc using a synthetic benchmark, but I'd love to see something like "unlimited cpu cycles" with a display of how many cycles are actually executed, same for video frames drawn. CPU usage stats in top or task manager are too unreliable to detect improvements of a few percent. Yet, a few percent here, a few percent there, it all sums up.