Hmm, I don't fully understand the advantages of splitting a piece of work between threads that can be done in a single thread. To avoid unwanted preemption of the thread doing audio rendering with mt32emu (that is by nature a realtime process), we'd rather need to boost the priority of that thread, likewise the mentioned JACK server does. Note, its authors also see no such advantages, so JACK 1 is also purely single-threaded, and only JACK 2 has limited support for multi-threading when the audio processing graph permits so.
However, I do clearly see disadvantages of doing multi-threading on a uni-processor system. All those penalties like context switching still apply. Yet an even more significant thing to take into account is exactly the thread synchronisation penalty. Our measurements we did at the early stage indicated that the pros-cons were unsatisfactory. Surely, CPU load did increase, but nothing like reliable rendering was achieved as a result. Therefore, I doubt that e.g. 2 rendering threads would help anyhow. But feel free to test on the particular device, I can't guarantee anything when it comes to performance, yet Pi2 has more cores than two...
On the other hand, multi-threading works just great when the thread synchronisation is weak, e.g. like for the Falcosoft VSTi plugin that runs two semi-independent mt32emu cores in the Dual Synth mode. However, when it comes to offloading rendering of several partials per thread, the synchronisation becomes not so cheap. And the rest of the emulation spends significantly less CPU time. Dunno, maybe it's actually worth it if you configure, say, 256 partials to render. With such high load I'd expect a performance boost. But 256 partials have nothing to do with MT-32 to be honest 😉
So, for any uni-processor system, I'd concentrate on improving thread affinity and boosting the priority of the rendering thread to the maximum. We currently have a cool interface with JACK in mt32emu-qt which is about lock-free rendering in the JACK realtime thread (in another git branch), so this may already help with fighting against the dropouts.
But for 4-8 cores, e.g. Pi2, a multi-threading at the partial level may also work fine. Albeit, there is another thing to consider: compatibility. Damn C++98 which we have to stick with for now, has no notion of threads, and an external dependency needs to be involved, like OpenMP. But I'm secretly hoping to move on to C++11 very soon, and then this problem will disappear. So, I'm not passionate about investing time into this area right now.