PHP - parsing contents of excel/pdf file already retrieved and stored in variable, without having to save contents to a file on disk

Question

Here is the scenario:

I have a variable in php that has raw contents of an excel file and I want to parse the contents of that variable (which is in an excel format, or can also be in a pdf format) for a certain value. I am looking for a keyword near the end of the contents of the file and will need to extract some of the contents near the desired value inside the contents of the file so I can get it into a variable in php and output to my webpage. From what I know the file is in binary, or hex representation but the ascii conversion is represented as readable text with diamond characters (with a question mark) and rectangles with a border and other extraneous characters including readable text content.

Here are the requirements:

I don't want parse the contents of the file by first storing or saving on disk. I want to parse the contents of the retrieved file directly while in a php variable.

Here is my question:

How do I go about this? Should I rely upon PHPExcel to read this content if possible? If not, what php libraries can accomplish this task?

No need to vote closure, I will update question – Vahe Jan 17 '21 at 20:25 — Vahe, Jan 17 '21 at 20:25

LSerni · Accepted Answer · 2021-01-17T23:05:44.000

Should I rely upon PHPExcel to read this content if possible?

It is not possible (see below).

If not, what PHP libraries can accomplish this task?

None that I know of.

How do I go about this?

An Excel file (rather, an Excel 2003+ XLSX file - Excel97 XLS files are a wholly different can of worms) is a ZIP archive containing XML and other files in a tree structure. So your first stage is to decompress a ZIP file in a string; PHPExcel relies on the ZipArchive class, and this, in turn, does not support string reading and also bypasses most stream hacks. A similar problem - actually exactly the same problem - is described in this question.

You could think of using stream wrapping to decode the file from a string, and the first part - the reading - would work. The writing of the files would not. And you cannot modify the ZipArchive class so that it writes to a memory object, because it is a native class.

So you can employ a slight variation, from one of the answers above (the one by toster-cx). You need to decode the ZIP structure yourself, and thus get the offset in the ZIP file where the file you need begins. This will either be /xl/worksheets/sheet1.xml or /xl/sharedStrings.xml, depending on whether the string has been inlined by Excel, or not. This also assumes that the format is the newer XLSX. Once you have that, you can extract the data from the string and decompress it, then search it for the token.

Of course, a more efficient use of the time would be to determine exactly why you don't want to use temporary files. Maybe that problem can be solved another way.

Speed problem

Actually, reading/writing an Excel file is not so terrible, because in this case you don't need to do that. You can almost certainly consider it a Zip file, and open it using ZipArchive and getStream() to directly access the internal sub-file you're interested in. This operation will be quite fast, also because you can run the search from the getStream() read cycle. You do need to write the file once, but nothing more.

In fact, chances are that you can write the file while it is being uploaded (what do you use for Web upload? The plupload JS library has a very nice hook to capture very large files one chunk at a time). You still need a temporary area on the disk where to store the data, but in this case the time expenditure will be exclusively dedicated to the decompression and reading of the XML sub-file - the same thing you'd have needed to do with a string object.

It is also (perhaps, depending on several factors, mainly the platform and operating system) possible to offload this part of the work to a secondary process running in the background, so that the user sees the page reload immediately, while the information appears after a while. This part, however, is pretty tricky and can rapidly turn into a maintenance nightmare (yeah, I do have first-hand experience on this. In my case it was tiled image conversion).

Cheating

OK, fact is I love cheating; it's so efficient. You say that you control the XLSX and PDF being created? Well! It turns out that in both cases, you can add hidden metadata to the file. And those metadata are much more easily read than you might think.

For example, you can add zip archive comments to a XLSX file, since it is a Zip file. Actually you could add a fake file with zero length to the archive, call it INVOICE_TOTAL_12345.xml, and that would mean that the invoice total is 12345. The advantage is that the file names are stored in the clear inside the XLSX file, so you can just use preg_match and look for INVOICE_TOTAL_([0-9]+)\.xml and retrieve your total.

Same goes for PDF. You can store keywords in a PDF. Just add a keyword attribute named "InvoiceTotal" (check the PDF to see how that turns out). But there is also a PDF ID inside the PDF, and that ID will be at the very end of the PDF. It will be something like /ID [<ec144ea3ecbb9ab8c22b413fec06fe29><ec144ea3ecbb9ab8c22b413fec06fe29>]^, but just use a known sequence such as deadbeef and ec144ea3ecbb9ab8c22deadbeef12345 will, again, mean the total is 12345. The ID before the known sequence will be random, so the overall ID will still be random and valid.

In both cases you could now just look for a known token in the string, exacly as requested.

My desire to avoid temporary files is to eliminate the need to read write and slow down web page load, though this is possibly what still will happen in loop while I decode those php files that don't have queryable information, I want to make it as much of a quick process as possible without reading writing files unnecessarily, Thank you for this well thought out answer. I will award the answer. Would you be ok with me waiting for additional inputs to accumulate a more comprehensive answer set? — Vahe, Jan 17 '21 at 21:07
I have added to my answer. It may be that you are worrying without need. ZipArchive is pretty fast. How many files are we talking of, and what size are they? — LSerni, Jan 17 '21 at 21:20
They are a various excel/pdf invoice files we create but are stored and retrieved from a vendor api that I need to fetch only when another query in a driver loop can't resolve invoice total. Main loop processing drives the fetch of each invoice total, or document fetch if the invoice total coalesces to $0 (no value found by query). I was going to output "See Invoice" but tried to think of a better way to get the total for these cases. — Vahe, Jan 17 '21 at 22:48
Note: I do not know how many concurrent requests are being made so I figured I would rely on dynamic memory to handle processing. — Vahe, Jan 17 '21 at 22:51
For your processing in the background comment, are you referring to asynchronous php? — Vahe, Jan 17 '21 at 22:56
Hm. And you can't, I don't know, run another query and get the total from there? Otherwise yes, I see you're going to have problems there. Also it might be awkward to cope with changes in either PDF or Excel format. — LSerni, Jan 17 '21 at 22:56
I anticipate the last dollar value is my desired extractable value needed for both pdf and xlsx. It is the total and there are no other dollar amounts that follow that value in the file content string. — Vahe, Jan 17 '21 at 22:57
That is dangerous, because the problem is that both in XLSX and PDF formats, the strings aren't exactly where, or how, you'd expect. But maybe I can add a cheat to the answer :-) — LSerni, Jan 17 '21 at 23:06
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/227470/discussion-between-vahe-jabagchourian-and-lserni). — Vahe, Jan 18 '21 at 06:30

PHP - parsing contents of excel/pdf file already retrieved and stored in variable, without having to save contents to a file on disk

1 Answers1

Speed problem

Cheating