What I have is an OCR'd Walmart receipt in a Google Document (Walmart allows you to email to yourself a .jpg version of your receipt, and this image can be opened with Google Docs, during which it applies OCR to extract text. The result is excellent with very few errors.)
Here is a link to the actual receipt from the OCR: https://docs.google.com/document/d/1zSV09UGajna4DPtrHUrB6F82NugpYjaftMjomoKVXpE/edit?usp=sharing
I have OCR'd hundreds of Walmart receipts using Google Docs. The OCR'd document suffers from some formatting issues, so I have written some code to help regularize it, as a part of a larger goal to get all of my receipts into a database.
While I am able to solve many of the formatting replacements, I am stuck trying to replace the percent sign that comes after "TAX 1" and "TAX 2" with a tab character, so that I can then push down the "TAX 2" entry to a new line. I noticed that the %
sign will always be followed by a newline character and then the actual numerical value of the tax (for both "TAX 1" and "TAX 2") on the next line:
Example OCR Text:
SUBTOTAL 126.61 TAX 1 6.750 %
7.78 TAX 2 2.000 %
0.23 TOTAL 134.62
Desired Output Text:
SUBTOTAL 126.61
TAX 1 6.750 % 7.78
TAX 2 2.000 % 0.23
TOTAL 134.62
Objective:
- Each (SUBTOTAL, TAX 1, TAX 2, and TOTAL) gets a new line. (this works)
- There should be a tab after each (SUBTOTAL, TAX 1, TAX 2, and TOTAL) so that the numeric value for each is a tab-stop away. (this works)
- I would like to replace the (space+percent sign+newline character) with just a percent sign and a tab (thinking the 7.78 should "rise" one line up as the newline character is taken out (this is what is failing).
I can do this using the CTRL-F "Find and Replace" menu in the Google Docs UI, using regex options without any problem, but I cant write an Apps Script function to do the same. I have searched everywhere. I realize that the RegEx in GAS is limited. But I don't know enough to know if that is my problem AND what a workaround could be. Likewise, I don't know enough of RegEx to know if the limited version GAS supports is not the cause of my problem (e.g. am I overlooking something).
Here's the code excerpt I use for formatting:
var body = DocumentApp.getActiveDocument().getBody();
/**
* other formatting stuff
*/
//Find SUBTOTAL, remove the space before SUBTOTAL and move it down one line.
body.replaceText(' SUBTOTAL', '\n\nSUBTOTAL\t');
//Find TAX 1, remove the space before TAX 1 and move it down one line.
body.replaceText(' TAX 1', '\nTAX 1\t');
//Find TAX 2, remove the space before TAX 2 and move it down one line.
body.replaceText(' TAX 2', '\nTAX 2\t');
//Find TOTAL, replace it.
body.replaceText('TOTAL', '\nTOTAL\t');
//Find PERCENT SIGN AND ADD A NEWLINE AFTER IT, replace it all with a tab character.
body.replaceText("[ %\n]","\t");
The first 4 replaces work great. Its the last one (the percent sign) that doesnt work. I've tried to escape that percent sign like this:
body.replaceText("[ \%\n]","\t");
and
body.replaceText("[ \\%\n]","\t");
I've tried to remove the braces like this:
body.replaceText(" \%\n","\t");
and
body.replaceText(" \\%\n","\t");
But each gives different results, frankly - messing up the entire receipt text badly.
So the percent sign is the problem - I think.
How can I fix the formatting for the "TAX 1" and "TAX 2" lines?
Example fulltext OCR'd receipt: https://docs.google.com/document/d/1zSV09UGajna4DPtrHUrB6F82NugpYjaftMjomoKVXpE/edit?usp=sharing