You can write programs with whatever conventions available from your compiler, but you must stick to the original conventions used in the libraries you're trying to replace.
And fastcall was not standardized anyway. It can differ across the compilers, so there's no point in using it in public APIs.
With Turbo C it would be correct to push all the arguments on stack in reverse order (for cdecl function), and expect the returned results in AX, or DX:AX.
When returning the structures the actual method depends on the structure size, small structures are packed into AX or DX:AX registers, larger structures are passed via statically allocated buffer, returning pointer in DX:AX for far pointers or AX for near pointer. Turbo C usually just copies structure bodies, without implementing any kind of RVO.
You are free to clobber all registers except BP, and you must restore SI/DI if you clobbered those when "register variables" optimization is enabled.