First post, by ih8registrations
memory, scalar, and dma blitting.
memory, scalar, and dma blitting.
Any issues for not adding to CVS? I'm always curious why things aren't added.
Doesn't gain anything, adds more code, and has issues?
Like the string copy can trigger pagefaults when it shouldn't.
It gains some, it adds more code, and I'm missing where the string copy triggers a pagefault.
It gains some
Not relevant amounts, so why bother.
it adds more code
Uglifies places that are quite straightforward and readable.
and I'm missing where the string copy triggers a pagefault.
mem_strlen, not the copy.
mem_strlen could be done without; I've only seen it called at startup. I did it because I could, using the same algorithm in string copy.
Profiling shows usage is spread out such that many small optimizations is how things are going to be improved, less doing something like threading. I have another patch here optimized with 64bit decoding and gcc_unlikely path optimizations which brings another small bump.
which brings another small bump
Well the problem at this stage is that adding complexity for very small speed
gains makes other optimizations/rewrites/changes harder, so they're not
useful imo as they are not noticeable on regular PCs, and on low-powered
devices you got pretty much different problems anyways.
But that's only my humble opinion of course.
The scaler BituMove might be interesting and is easy enough (not sure if
the 16byte alignment is fine though). Did you profile some stuff with that?
Especially default modes (320x200 games with normal2x scaler).
Yeah, long ago. BTW, I'm on an Athlon XP with only 256k L2. In my testing, some games cycle between the three conditions, <8, >=8, >=8 /w tmp(remainder), but most hit one condition exclusively or most of the time. To improve size for even lower cache cpus, could change the <8 case to be a repeat byte rather than the if dword, if word, if byte.
It could be further shrunk by having the remainder fall through, using the <8 loop, like so:
static void DMA_BlockRead(PhysPt pt,void * data,Bitu size) {
Bit32u page=pt>>12;
Bit32u * pagemap;
Bit32u mask;
if (page < LINK_START) { Bit32u pageend=(pt+size)>>12; pagemap=pmap[pageend < EMM_PAGEFRAME4K]; mask=~0; }
else { pagemap=&page; mask=0; }
Bit64u * writeq=(Bit64u *) data;
if (size>=8) {
Bit8u tmp=size&0x07;
size>>=3;
do {
*writeq++=phys_readq(pagemap[(pt>>12)&mask]*4096 + (pt & 4095));
size--; pt+=8;
} while (size);
if (!tmp) return;
size=tmp;
}
Bit8u * write=(Bit8u *) writeq;
do {
*write++=phys_readb(pagemap[(pt>>12)&mask]*4096 + (pt & 4095));
pt++;
} while (--size);
}
Alternately, could do something like this:
#define optimize 1
/* read a block from physical memory */
static void DMA_BlockRead(PhysPt pt,void * data,Bitu size) {
Bit32u page=pt>>12;
Bit32u * pagemap;
Bit32u mask;
if (page < LINK_START) { Bit32u pageend=(pt+size)>>12; pagemap=pmap[pageend < EMM_PAGEFRAME4K]; mask=~0; }
else { pagemap=&page; mask=0; }
#ifdef optimize
if (size>=8) {
Bit64u * writeq=(Bit64u *) data;
Bit8u tmp=size&0x07;
size>>=3;
do {
*writeq++=phys_readq(pagemap[(pt>>12)&mask]*4096 + (pt & 4095));
size--; pt+=8;
} while (size);
if (tmp) {
tmp=8-tmp; pt-=tmp; writeq=(Bit64u *)((Bit8u *)writeq-tmp);
*writeq=phys_readq(pagemap[(pt>>12)&mask]*4096 + (pt & 4095));
}
return;
}
#endif
Bit8u * write=(Bit8u *) data;
do {
*write++=phys_readb(pagemap[(pt>>12)&mask]*4096 + (pt & 4095));
pt++;
} while (--size);
}