Not sure about libraries, and you also haven't mentioned the format of the video input files (I'll presume they are in a compressed format like H.264 since if they are raw that is just a subset), but I would need to do shis on Windows, I'd do the following:
1) Read and decoded the frames from the input files (either with FFMPEG or VFW) an then put the encoded data in a larger bitmap with the resulting size of the 4 screens
2) Since now it is a raw bitmap apply the text or whatever is needed using e.g. DrawText
(http://msdn.microsoft.com/en-us/library/windows/desktop/dd162498(v=vs.85).aspx), to ease the use of WinAPI you could use some GDI wrapper library.
I guess one of the main pitfalls here is to properly synchronize the presentation times of the frames from different files, since they can all have different fps and time breaks so you can't just read frame by frame but you need to keep track which frame from which file is supposed to be presented at each step when applying the transformations you need.