So in my implementation the MDA/CGA text mode and CGA/VGA graphics modes behave identically. The only difference is the parsing of the VRAM data, which happens at the character clock (every 4 or 8 pixels). At this point in time, on the MDA/CGA 2 bytes are fetched from VRAM (4 on EGA/VGA). This also applies memory behaviour (CPU overwriting the data).
It then, depending on the mode, splits it into 8 attribute input pixels and font/background selection bits, which depends on the video mode using 8 bit shifts.
For the next 8 clocks (can be 16/32 on the VGA) it starts sending the latched and predecoded(in text or graphic mode) data to the attribute controller (or CGA hardcoded version) one pixel at a time. The data sent is two things: the raw color(attribute) byte and font/background status (and req'd attribute status, like the current character line for underline and blink status(which is toggled each frame, as per hardware specs for the emulated card)). All the VGA attribute registers (still emulated on the CGA/MDA) are fully functional, just driven (written into deending on the used mode) by the CGA/MDA mode control and CGA color register. The attribute controller output is then either parsed by a DAC lookup(EGA/VGA) or the display mode (Simple CGA old/new color lookup(no difference though) or Reengine's NTSC routine. Both which caches the entire CGA/MDA scanline and renders the 4-bit pixels appropriately). The only difference is that graphics modes force the font bit on. The stuff that's sent is what's between the VGA sequencer and attribute controller (it's input pins. See the Tseng datasheets for the exact data sent). All else is determined by the emulated attribute controller (which is no more than a precalculated LUT (for it's inputs) and a toggle to merge multiple nibbles, bytes or words to create an 8 or 16-bit DAC index).
Vertical retrace also flips the double buffer in my GPU routines to write to a new zeroed screen buffer. If scanlines aren't rendered completely the same on all scanlines (the widest one detected applies), shorter scanlines will display black.
So in short, the memory data is latched at the character clock and split into font bits and attribute color to use. Then it does NOPs for 7 pixel clocks afterwards, as it shifts out the latched values one pixel at a time into the attribute controller, which handles the color logic. It's output is on the CGA/MDA written to a single scanline buffer (1 byte per pixel), which at horizontal retrace is rendered onto the display buffer in either direct colors or NTSC mode. All other basic display handling is the same as in my EGA/(S)VGA implementation (only differences being some slight CGA adjustments, but it's mostly intact).