Should I rely upon PHPExcel to read this content if possible?
It is not possible (see below).
If not, what PHP libraries can accomplish this task?
None that I know of.
How do I go about this?
An Excel file (rather, an Excel 2003+ XLSX file - Excel97 XLS files are a wholly different can of worms) is a ZIP archive containing XML and other files in a tree structure. So your first stage is to decompress a ZIP file in a string; PHPExcel relies on the ZipArchive class, and this, in turn, does not support string reading and also bypasses most stream hacks. A similar problem - actually exactly the same problem - is described in this question.
You could think of using stream wrapping to decode the file from a string, and the first part - the reading - would work. The writing of the files would not. And you cannot modify the ZipArchive class so that it writes to a memory object, because it is a native class.
So you can employ a slight variation, from one of the answers above (the one by toster-cx). You need to decode the ZIP structure yourself, and thus get the offset in the ZIP file where the file you need begins. This will either be /xl/worksheets/sheet1.xml
or /xl/sharedStrings.xml
, depending on whether the string has been inlined by Excel, or not. This also assumes that the format is the newer XLSX. Once you have that, you can extract the data from the string and decompress it, then search it for the token.
Of course, a more efficient use of the time would be to determine exactly why you don't want to use temporary files. Maybe that problem can be solved another way.
Speed problem
Actually, reading/writing an Excel file is not so terrible, because in this case you don't need to do that. You can almost certainly consider it a Zip file, and open it using ZipArchive and getStream() to directly access the internal sub-file you're interested in. This operation will be quite fast, also because you can run the search from the getStream() read cycle. You do need to write the file once, but nothing more.
In fact, chances are that you can write the file while it is being uploaded (what do you use for Web upload? The plupload JS library has a very nice hook to capture very large files one chunk at a time). You still need a temporary area on the disk where to store the data, but in this case the time expenditure will be exclusively dedicated to the decompression and reading of the XML sub-file - the same thing you'd have needed to do with a string object.
It is also (perhaps, depending on several factors, mainly the platform and operating system) possible to offload this part of the work to a secondary process running in the background, so that the user sees the page reload immediately, while the information appears after a while. This part, however, is pretty tricky and can rapidly turn into a maintenance nightmare (yeah, I do have first-hand experience on this. In my case it was tiled image conversion).
Cheating
OK, fact is I love cheating; it's so efficient. You say that you control the XLSX and PDF being created? Well! It turns out that in both cases, you can add hidden metadata to the file. And those metadata are much more easily read than you might think.
For example, you can add zip archive comments to a XLSX file, since it is a Zip file. Actually you could add a fake file with zero length to the archive, call it INVOICE_TOTAL_12345.xml
, and that would mean that the invoice total is 12345. The advantage is that the file names are stored in the clear inside the XLSX file, so you can just use preg_match
and look for INVOICE_TOTAL_([0-9]+)\.xml
and retrieve your total.
Same goes for PDF. You can store keywords in a PDF. Just add a keyword attribute named "InvoiceTotal" (check the PDF to see how that turns out). But there is also a PDF ID inside the PDF, and that ID will be at the very end of the PDF. It will be something like /ID [<ec144ea3ecbb9ab8c22b413fec06fe29><ec144ea3ecbb9ab8c22b413fec06fe29>]^
, but just use a known sequence such as deadbeef
and ec144ea3ecbb9ab8c22deadbeef12345
will, again, mean the total is 12345. The ID before the known sequence will be random, so the overall ID will still be random and valid.
In both cases you could now just look for a known token in the string, exacly as requested.