VOGONS


Reply 20 of 634, by FreddyV

User metadata
Rank Member
Rank
Member

Hi,

I mix into a single buffer, but there are 4 x 1/50 of second buffers.
When the Sound blaster is playing the 1st Buffer, Mod Master is mixing the 4th buffer.

For example, If we have this during 4 x 1/50 of Second:
1 : 4 Channels
2 : 3 Channels
3 : 2 Channels
4 : 2 Channels
-> This is like mixing 3 channels in average.

With 2 Buffers, it is like Mixing 3.5 Channels then 2.

If the code can not mix more than 3 Channels, it will work in the 1st case, not in the second.
The trick with the Minimum volume to play and skipping the last channels if the mix code is only 1 Buffer ahead of the sound card is here to be sure that the mixing code is always ahead of the sound blaster.
Then, there is no "Click" and only some parts of the samples missing.

If GLX really mix the 4 Channels all the time, GLX performance is poor compared with MOD Master on a file using lots of Volume Slide...
This is exactly what we see with the Crystel Dream 2 music. GLX Fail at 15KHz when I can play it with Mod Master at 18KHz

I am sure that GLX mixing code is faster because I know I can optimize Mod Master code and with music playing 4 channels all the time, Mod Master is obviously slower.

I don't understand what you mean by mix without a volume table.

It is really too slow to mix with a Multiplication and There is not enaugh memory to pre calculate all the samples at all the volumes.
I am 100% sure GLX is not doing this as it will take a lot of time, there is not enaugh memory to do this, and GLX Files loading is fast.

You also say that always MIX 4 Channels is the fastest way, I disagree.

For each buffer byte, we can't switch between the 4 Channels as we need to read each sample Segment/Offset, the increments, the volume and so on. There is not enaugh registry and it is a waste of time.

8088 MPH Code is doing it, but it is possible ONLY because all the samples are in the same segment AND they are pre calculated with all the volumes.

I hope I am more clear now.

Reply 21 of 634, by FreddyV

User metadata
Rank Member
Rank
Member

Hi,

I found that in the Mod Master release I published, I broke the Vibrato when I optimized the partition reading.
It is already corrected and I found some other problem present even before (Period not restored after the Vibrato)

Not simple, when we touch something, we break something else 😀

FreddyV

Reply 22 of 634, by Scali

User metadata
Rank l33t
Rank
l33t
FreddyV wrote:

8088 MPH Code is doing it, but it is possible ONLY because all the samples are in the same segment AND they are pre calculated with all the volumes.

Yes, 8088 MPH uses only 16 volume levels. Which still is an acceptable amount I suppose, especially when you're mixing to 8-bit anyway.
The other limitation to get everything to fit into a single segment is to limit samples to 256 bytes. That is not an option for a generic mod player.
The maths would then be something like:
16 instruments * 16 volumes * 256 bytes = 64k.
So I think a third limitation is that you can't use more than 16 instruments.

One trick that GLX does is that it unrolls the resample loop with fixed indices. This is not a 100% accurate resampling to the correct pitch, but it seems to work well enough in practice, and cuts quite a bit of overhead.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 23 of 634, by FreddyV

User metadata
Rank Member
Rank
Member

Hi Scali,

What do you mean by unrolls the resample loop with fixed indice ?

I extend the loop so that I can check if we are too far each 64 samples and not all the time.

When I listen at Mod Master and GLX playing at the same frequency rate, GLX has like a small pitch difference, It is difficult to explain.

By the way, when I wrote I still can improve the mix code, I just modified it and it is now 16% faster.
ENIGMA.MOD can be played at.. 32KHz
CD2 Part 1 can be played at 16900 Hz without skipping anything, when glx needs 14KHz
(Still on my 8MHz 8086 computer)

Anyway there is still something strange. I can hear sometimes a small Click, and If I go back to the same part of the music, it is gone.
These Clicks are not here under DosBox.

And There is still a replay bug added by my changes with the Arpeggio on a sample using Fine Tune. I did not finish the way it manage the Notes / Period

Reply 24 of 634, by MobyGamer

User metadata
Rank Member
Rank
Member
FreddyV wrote:

This is like mixing 3 channels in average.

Oh, okay, you're just using bigger/more buffers to mix ahead. When I wrote my player in the '90s, I just standardized on a 32K buffer, and mixed into whichever 16K half wasn't currently being played via DMA. So this was several rows ahead, way more than 4, but if you don't need interactive sound or low latency, it's fine. For demos I used much smaller buffers of course.

I don't understand what you mean by mix without a volume table.

Sorry, I wasn't clear. By volume table, I meant the final conversion from the mixed sample to the output. One way to mix is to a 16-bit value and then you use a table to convert to 8-bit for output. For a 4-channel mod, you could get away without using a table by either shifting the 16-bit result right twice (ie. SHR AX,2) or you could eliminate the shift entirely by mixing directly to an 8-bit value by preconverting your 8-bit samples to 6-bit first so that there is no overflow during mixing. Quality is not as good, of course.

You also say that always MIX 4 Channels is the fastest way, I disagree.

Only if you have 2 or less channels playing. GLX uses 16-bit ADDs to mix 2 samples at once so there isn't much penalty. Reenigne's RE of the GLX code shows this as the mixing routine (I don't think he'll mind me sharing our correspondence):

As far as I can tell, the code is the same for all output devices (PC speaker uses an interrupt so it doesn't need a special mix […]
Show full quote

As far as I can tell, the code is the same for all output devices (PC speaker uses an interrupt so it doesn't need a special mixing routine). Here's what it looks like:

 es: mov bl,[si+0000]
mov al,[bx]
es: mov bl,[si+0000]
mov ah,[bx]
add [di+0],ax

This section repeats 10 times (increasing the DI offset by 2 bytes each time), so it does 20 samples at once. At 50 frames per second and 22KHz we have 440 samples per frame - the optimum unroll size is going to be roughly the square root of that, so that seems about right. Mixing 2 samples at once by using a 16-bit add is also a clever trick.

This takes 100 cycles for 2 samples before refresh, so is only just fast enough.

The code is self-modifying - the 0000s in the "es: mov bl,[si+0000]" lines are modified by this loop:

o97dd:
mov [si+o983c + 3],ax
add bp,di
adc ax,dx
mov [si+o983c + 0a],ax
add bp,di
adc ax,dx
add si,0011
loop o97dd

So the code is cheating slightly by using this same "table of offsets" for each block of 20 (ideally the table should be slightly different each time but I guess the difference isn't audible).

Reply 25 of 634, by MobyGamer

User metadata
Rank Member
Rank
Member

I tried to get MODM22XT working on my IBM PC 5160 but couldn't get it to work; playing a module switches to the pattern/row display but nothing sounds. I was using as PAS in SB compatibility mode; I'll swap cards and see if I can get better results. Also, I don't have a mouse so I had to generate MODM.CFG on a machine with a mouse so I could set the speed lower, disable EMS, etc. Consider maybe making it possible to adjust dialog options without requiring a mouse? Or maybe command-line options? Most PC/XT class systems don't have mice, you were spoiled with your Amstrad 😉

Reply 26 of 634, by Scali

User metadata
Rank l33t
Rank
l33t
FreddyV wrote:

What do you mean by unrolls the resample loop with fixed indice ?

MobyGamer has already answered that by posting the relevant code above.
Basically the interpolation routine is unrolled, with the addresses for fetching the samples hardcoded in the loop, rather than using some sort of DDA algorithm.
As he points out, the fixed indices are re-used for every 20 iterations, which is fast, but not entirely correct.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 27 of 634, by FreddyV

User metadata
Rank Member
Rank
Member

Hi,

MobyGamer wrote:

I tried to get MODM22XT working on my IBM PC 5160 but couldn't get it to work; playing a module switches to the pattern/row display but nothing sounds.

I don't have a PAS (Quite expensive now 😀 ) so I don't know what is happening.
It is probably because the MODM.CFG File has a problem inside, like the IRQ/DMA/Port.
You don't have a standard serial mouse to put in the serial port ?
Mod Master 2.0 was supporting the full keyboard, and add full keyboard support is a big job, I am more motivated to work on the replay code.
You can also test Mod Master under DOSBox, to see the mixing quality. It is much better than on the noisy 8 Bit sound Blaster cards…..

So thanks for sharing this, I now understand that there is no real secret and I understand the performance/Sound quality difference.
We both use an 8Bit volume table and Both mix the channels one by one (Mix 2 Samples at once can be read 2 "Output buffers" samples or 2 Channels as each channel play a sample)

Here is my code (Updated yesterday), I integrate the Index displacement in the code and copy the code 32 times, not 10:

        MOV AX,ES:[DI]
MOV BL,[SI]
ADD AL,ES:[BX]
ADD DH,DL
ADC SI,BP
MOV BL,[SI]
ADD AH,ES:[BX]
STOSW
ADD DH,DL
ADC SI,BP

The code used in the last YouTube Video was this one, copied 64 Times:

        MOV AL,ES:[SI]
XLAT
ADD [DI],AL
INC DI
ADD DH,DL
ADC SI,BP

- By integrating the index displacement in the mixing code, I save the slow ""Based index relative" (EA Calculation needs 11 or 12 Cycles !)
ADD DH,DL / ADC DH,DL takes 6 Cycles
- The 2nd Loop to calculate the index is no more needed. (Even if it is small)
- The sample end / Loop check is done less often and there are less loops

I did not want to do a 16 bit ADD as it sometimes add 1 to the AH registry.
Then, I did like this:
I put the Volume and Mix Buffer Segment in the ES Register, this allow me to do replace the ADD [di],AX by a MOV AL,ES:[DI] then STOSW + Directly add the samples when reading the volume table.
Then, we don't need to do INC DI / INC DI or add the relative address like in GLX.

If GLX Code was like this, It may be faster and have a better sound quality:

 mov ax,ES:[DI]             
mov bl,[si+0000]
add al,ES:[bx]
mov bl,[si+0000]
add ah,ES:[bx]
stosw

I have some question for you:
- I saw in documentation that MOV AX,mem is done in 10 Cycles, whatever is the addressing mode used.
I believe it is wrong as it surely take longer to read this instruction MOV AX,ES:[DI+xxx] (5 Bytes) than this one MOV AX,[DI] (2 Bytes)

- I did not find the cycles needed for the STOSW instruction.
- Where can I find correct documentation for instructions cycles ?

- I also believe that moving some line of code and align the code can help, but I don't know exactly how the prefetch work and if the code alignment matters.
For example with this instruction : MOV bl,[si+0000], alignment may help as it will read the '0000' in one time instead of 2, but for other instructions ?

Last edited by FreddyV on 2019-05-29, 14:54. Edited 2 times in total.

Reply 28 of 634, by FreddyV

User metadata
Rank Member
Rank
Member

Hi, it is me again.

So, finally, to simplify the test, I added the Frequency change with the Left/Right Keys under the Output device menu 😀 (For MobyGamer)

If you want to test the speed limit, use the Debug Mode.

The 2 Counters after the "SlowCnt" are incremented each time the Mixing Code is 1/50s late and 2/50s late
If it is 1/50s late, Mod Master does nothing : Everything is mixed
If it is 2/50s Late, Mod Master skip the next channel to be mixed and increase the minimum volume to be mixed for the next time. (This is the 3rd Value)

If the second counter (In White) remains at 0, the music is mixed at 100% so you can increase the mix frequency to keep this value at 0 to have the "Real" Max frequency
Anyway, it is not because this counter is not 0 that something is skipped: If the mixing is doing the Signed buffer to not signed buffer convertion when the counter is incremented, everything is mixed as well..

You can increase the frequency further, like 2KHz more and have a better sound quality and really few differences.

What is funny with this method is that with a 4 Channel module, you can go up to 45KHz and still recognize the music 😀 (On a 8086 8MHz)
To quit the music, Press Escape and wait that the player has some time to process it (Key reading is in the foreground display update code)

Filename
MODMXT2.zip
File size
68.76 KiB
Downloads
21 downloads
File comment
Mod Master 2.2 XT Rev 2
File license
Fair use/fair dealing exception

Reply 29 of 634, by MobyGamer

User metadata
Rank Member
Rank
Member
FreddyV wrote:

You don't have a standard serial mouse to put in the serial port ?

IBM PC doesn't come with a serial port unless you add one 😀 but I will find something. Actually, I can use a joymouse (mouse driver for joystick).

- By integrating the index displacement in the mixing code, I save the slow ""Based index relative" (EA Calculation needs 11 or 12 Cycles !)
ADD DH,DL / ADC DH,DL takes 6 Cycles

When it comes to 808x optimization, the smaller code is almost always faster -- the bottleneck for just about everything is taking 4 cycles just to read a byte of memory (on 8086, 4 cycles to read a word of memory). So if you're replacing one sequence with another sequence that is larger, you should always profile (measure, benchmark) the change to ensure it really is faster. Do not always believe Intel published instruction cycle counts; they're "rounded" upwards to 4 cycles because reading memory is so slow and the prefetch queue is so small. The only instructions that really do take a long time are the MUL/DIV instructions, so if you can replace a 144-cycle MUL with 8-10 instructions, the latter will be faster. Nobody is perfect; I have optimized code only to find that it was slower.

There is no substitute for profiling 😀 Use the Zen timer if you don't want to write timer code: ftp://ftp.oldskool.org/pub/misc/ZTIMER11.ZIP

If GLX Code was like this, It may be faster and have a better sound quality: […]
Show full quote

If GLX Code was like this, It may be faster and have a better sound quality:

 mov ax,ES:[DI]             
mov bl,[si+0000]
add al,ES:[bx]
mov bl,[si+0000]
add ah,ES:[bx]
stosw

It would be slower. The original inner loop is 12 bytes but your changes make it 14 bytes.

- I saw in documentation that MOV AX,mem is done in 10 Cycles, whatever is the addressing mode used.
I believe it is wrong as it surely take longer to read this instruction MOV AX,ES:[DI+xxx] (5 Bytes) than this one MOV AX,[DI] (2 Bytes)

The specific form "MOV AX,[1234]" is 3 bytes (A1 34 12] and 10 cycles. Other forms are larger/longer.

- I did not find the cycles needed for the STOSW instruction.
- Where can I find correct documentation for instructions cycles ?

The 8086 Programmer's Reference from Intel. I believe a copy is here: ftp://ftp.oldskool.org/pub/misc/References/80 … Programming.rar
You can also consult an online reference; here's one of many: http://www.stanislavs.org/helppc/idx_assembler.html

- I also believe that moving some line of code and align the code can help, but I don't know exactly how the prefetch work and if the code alignment matters.
For example with this instruction : MOV bl,[si+0000], alignment may help as it will read the '0000' in one time instead of 2, but for other instructions ?

It does matter, but only on 8086, not 8088. 8088 can read one byte in 4 cycles and has a 4-byte prefetch queue. 8086 can read TWO bytes in 4 cycles, and has a 6-byte prefetch queue. Since you have an 8086, you can improve performance by doing as many 16-bit reads/writes as possible, and ensuring your data is aligned to word boundaries. STOSW writing to a word-aligned address will write the word in one operation -- if writing to an odd address, it will take two operations.

Instruction alignment seems like it would help a little, as the 8086 prefetch queue reads one word at a time, but there's no way to take advantage of it. You can't manually align the code with NOPs or something because the padding necessary to do so would eat more time than you would save. And even ensuring that subroutines start on a word boundary doesn't matter because, when you JMP/CALL/INT to them, the prefetch queue is emptied. So there's no point worrying about instruction alignment.

Even though these hints only benefit 8086, everyone should do them anyway. It's good practice and can benefit 8088 in other subtle ways (like better design resulting in smaller code).

FreddyV wrote:

So, finally, to simplify the test, I added the Frequency change with the Left/Right Keys under the Output device menu 😀 (For MobyGamer)

Thanks 😀 I'll try to test soon. I'll use a SBPro (I don't think I have a SB 2.0 any more), hopefully that can use/set the mono greater-than-22KHz output ok.

If it is 2/50s Late, Mod Master skip the next channel to be mixed
...
What is funny with this method is that with a 4 Channel module, you can go up to 45KHz and still recognize the music 😀 (On a 8086 8MHz)

That is funny, I'll have to give that a try 😁 While silly, it might be useful for "previewing" mods with many more channels than the speed of the hardware can adequately handle.

Reply 30 of 634, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
MobyGamer wrote:

When it comes to 808x optimization, the smaller code is almost always faster -- the bottleneck for just about everything is taking 4 cycles just to read a byte of memory (on 8086, 4 cycles to read a word of memory).

My favourite rule of thumb to get a rough idea of whether a code change will improve performance on 8088 is to add up the instruction bytes and the number of bytes of bus operations (memory/port reads/writes) that each instruction does. I guess the equivalent on 8086 is to halve the number of code bytes first and subtract 1 for each aligned word read and write. Though because of the 8086's wider bus I think the execution unit is often the larger bottleneck and the heuristic might not work as well.

MobyGamer wrote:

Instruction alignment seems like it would help a little, as the 8086 prefetch queue reads one word at a time, but there's no way to take advantage of it. You can't manually align the code with NOPs or something because the padding necessary to do so would eat more time than you would save. And even ensuring that subroutines start on a word boundary doesn't matter because, when you JMP/CALL/INT to them, the prefetch queue is emptied. So there's no point worrying about instruction alignment.

I'm not sure that's true. If I recall correctly, the bus only works on a word at a time on word-aligned addresses. So if you jump to an odd address the prefetcher will do a one-byte prefetch to get aligned and then continue with two-byte prefetches. So if a jump target (that isn't reachable by falling through from the previous instruction) is aligned, you get a "free" byte of prefetch that you don't get if it's misaligned.

Reply 31 of 634, by FreddyV

User metadata
Rank Member
Rank
Member

It would be slower. The original inner loop is 12 bytes but your changes make it 14 bytes.

Are you sure ?
I count the same size and my code use faster instructions:

 mov ax,ES:[DI]    3    
mov bl,[si+0000] 4
add al,ES:[bx] 3
mov bl,[si+0000] 4
add ah,ES:[bx] 3
stosw 1
18

es: mov bl,[si+0000] 5
mov al,[bx] 2
es: mov bl,[si+0000] 5
mov ah,[bx] 2
add [di+0],ax 4
18

Also, we don't need the ADD DI,20 at the end of the Mixing Block plus no need to modify this part of the code 😀

Anyway, I did the comparison, Mod Master code is 2 Bytes more in size than GLX one. (Only)
But it mix 64 samples per loop instead of 10 and the not auto modified code, this explain clearly the difference we see 😀
If we count this for a complete buffer mixing, the code size is close, and much more loop that consume a lot (Clean the prefetch)

Use Auto modified code require to do more loop at the end.

The problem I have with my code is that it use more memory: The interface is written in Turbo Pascal, It does much more things and the sample extention need to be longer.

Reply 32 of 634, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie
FreddyV wrote:
[…]
Show full quote
 add [di+0],ax         4

The code is set up so that the offset from di is always in the -128 to 127 range, so this instruction uses the 3 byte encoding.

Reply 33 of 634, by FreddyV

User metadata
Rank Member
Rank
Member
reenigne wrote:
FreddyV wrote:
[…]
Show full quote
 add [di+0],ax         4

The code is set up so that the offset from di is always in the -128 to 127 range, so this instruction uses the 3 byte encoding.

Ok, nice to know 😀
Thanks for all your infos.

I realized that I did not read correctly the cycles count doc.
I started to code 2 month ago, after a break of 20 Years. As I was doing 486 code, I did not enter in all the instructions details.

So I think I can try a mix of my code and GLX, but I would not like to lower the sound quality…
I may try just for fun, but go back to my code after.

I also have work to do to reduce the code size, optimize the partition reading, test on 8088 and add Tandy DAC support (Easy I presume)

If anybody have suggestion for improvement, feel free to ask

Reply 34 of 634, by FreddyV

User metadata
Rank Member
Rank
Member

Hi,

I was in a retro computing convention this week end.
I was the only one with a PC surrounded by Amiga 😀
This was fun to play 8 channels MOD when an amiga 500 Can't.

They were really surprised by this PC playing MOD.

I added one more Buffer: The result is nice.
I changed the sound card in the PC1640 by a Sound Blaster pro and there is no more click: The sound is excellent and it motivates me to put back the Stereo support.

Reply 35 of 634, by Scali

User metadata
Rank l33t
Rank
l33t
FreddyV wrote:

This was fun to play 8 channels MOD when an amiga 500 Can't.

The Amiga can, actually 😀
The most wellknown 8-channel tracker on Amiga is OctaMED: https://en.wikipedia.org/wiki/OctaMED
But there were also more 'traditional' SoundTracker/ProTracker-clones with 8-channel support. One I recall is Fairlight's Startrekker: https://www.pouet.net/prod.php?which=13415

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 37 of 634, by Scali

User metadata
Rank l33t
Rank
l33t
FreddyV wrote:

I have been told that 8 Channels tracker were not made on amiga 500.

Then you were told wrong. Both work fine on an Amiga 500.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/

Reply 38 of 634, by reenigne

User metadata
Rank Oldbie
Rank
Oldbie

Presumably the confusion has come about because the Amiga can only mix up to 4 channels in hardware. To have more channels you need to software mixing (just like a PC with SoundBlaster can only output 1 channel in hardware so you need to do software mixing to output multiple channels there too).

Reply 39 of 634, by Scali

User metadata
Rank l33t
Rank
l33t

Yes, software mixing was used on Amiga from time to time, in games such as Turrican II, to allow for one or more extra channels for sound effects, while the music can continue playing without sacrifice.
It is mentioned on MobyGames:
https://www.mobygames.com/game/turrican-ii-th … al-fight/trivia

The game's music was originally written for the Amiga using the TFMX sound format - some of it actually contains 7 channels! This was actually achieved by the technique used to play tracker modules on the Atari ST, on which the channels had to be mixed by software rather than hardware like on the Amiga. The Amiga 7 channels sound routines was invented by ST music programmer Jochen Hippel and was adapted by Chris Hülsbeck for the Turrican title tune. Basically the method uses one hardware channel of the Amiga soundchip to play 4 software mixed channels, resulting in 7 independent channels.

http://scalibq.wordpress.com/just-keeping-it- … ro-programming/