The technical background is like this:
In order to use hardware T&L, the GPU wants the data in ready-to-use buffers, preferably in video memory.
In DX6 and earlier, you passed a pointer from system memory to the DrawPrimitive() call directly.
So the driver didn't see the data until it was ready to draw. The problem here is that it is difficult to make any assumptions about this data. The same buffer can have different geometry for every call (the only way to tell would be to copy the whole buffer and compare it everytime, which is too costly), so there's no point in optimizing the data for GPU T&L and storing it in videomemory, because it may only be used once.
In DX7, Microsoft specified vertex buffers, which you had to lock before you could modify them. So now the API knew *exactly* when the buffer was being modified. Also, you could specify that you wanted your buffers stored in videomemory. So now the API had enough heuristics to know when data was static, and when it could be uploaded and optimized for GPU T&L.
In OpenGL, you had display lists, which already provided enough heuristics for optimizing the data for GPU T&L. Which meant that even legacy applications could automatically benefit from GPU T&L.
OpenGL later got vertex buffer objects similar to DX7 though, which gave more explicit control, for maximum efficiency.