The problem with PDF is that, at worst, it's just a bunch of individual characters placed at particular co-ordinates on a page. (I.e., words and lines of text are all in the eye of the beholder.) Now, the particular PDF files you have might be better behaved than that, but I don't know. In any case, PDF files are complex data structures, so parsing them is complex, and extracting the text is not straightforward.
Now, technically, an xml or xhtml file could be just as hairy as a PDF. (E.g., you could have an xml file that is just a list of elements like <letter loc="234,1743">A</letter>
.) But in practice, they aren't. If you can look at an xml/xhtml file and see the text you're interested in, then it will probably be easy to extract it programmatically.
Epub would be comparable to xml/xhtml in terms of losslessness, but might be a bit more complicated to deal with.
It would probably be a good idea to find out how the documents were authored, and how the various formats were derived. (I.e., if the assumption that the files are faultless is incorrect, that might have a bigger effect on the choice of format to use.)