String replacement in OCR'd purchase receipts

Question

What I have is an OCR'd Walmart receipt in a Google Document (Walmart allows you to email to yourself a .jpg version of your receipt, and this image can be opened with Google Docs, during which it applies OCR to extract text. The result is excellent with very few errors.)

Here is a link to the actual receipt from the OCR: https://docs.google.com/document/d/1zSV09UGajna4DPtrHUrB6F82NugpYjaftMjomoKVXpE/edit?usp=sharing

I have OCR'd hundreds of Walmart receipts using Google Docs. The OCR'd document suffers from some formatting issues, so I have written some code to help regularize it, as a part of a larger goal to get all of my receipts into a database.

While I am able to solve many of the formatting replacements, I am stuck trying to replace the percent sign that comes after "TAX 1" and "TAX 2" with a tab character, so that I can then push down the "TAX 2" entry to a new line. I noticed that the % sign will always be followed by a newline character and then the actual numerical value of the tax (for both "TAX 1" and "TAX 2") on the next line:

Example OCR Text:

SUBTOTAL 126.61 TAX 1 6.750 %
7.78 TAX 2 2.000 %
0.23 TOTAL 134.62

Desired Output Text:

SUBTOTAL    126.61 
TAX 1 6.750 %   7.78
TAX 2 2.000 %   0.23
TOTAL   134.62

Objective:

Each (SUBTOTAL, TAX 1, TAX 2, and TOTAL) gets a new line. (this works)
There should be a tab after each (SUBTOTAL, TAX 1, TAX 2, and TOTAL) so that the numeric value for each is a tab-stop away. (this works)
I would like to replace the (space+percent sign+newline character) with just a percent sign and a tab (thinking the 7.78 should "rise" one line up as the newline character is taken out (this is what is failing).

I can do this using the CTRL-F "Find and Replace" menu in the Google Docs UI, using regex options without any problem, but I cant write an Apps Script function to do the same. I have searched everywhere. I realize that the RegEx in GAS is limited. But I don't know enough to know if that is my problem AND what a workaround could be. Likewise, I don't know enough of RegEx to know if the limited version GAS supports is not the cause of my problem (e.g. am I overlooking something).

Here's the code excerpt I use for formatting:

var body = DocumentApp.getActiveDocument().getBody();
/**
 * other formatting stuff
 */

//Find SUBTOTAL, remove the space before SUBTOTAL and move it down one line.
body.replaceText(' SUBTOTAL', '\n\nSUBTOTAL\t');

//Find TAX 1, remove the space before TAX 1 and move it down one line.
body.replaceText(' TAX 1', '\nTAX 1\t');

//Find TAX 2, remove the space before TAX 2 and move it down one line.
body.replaceText(' TAX 2', '\nTAX 2\t');

//Find TOTAL, replace it.
body.replaceText('TOTAL', '\nTOTAL\t');

//Find PERCENT SIGN AND ADD A NEWLINE AFTER IT, replace it all with a tab character.
body.replaceText("[ %\n]","\t");

The first 4 replaces work great. Its the last one (the percent sign) that doesnt work. I've tried to escape that percent sign like this:

body.replaceText("[ \%\n]","\t");

and

body.replaceText("[ \\%\n]","\t");

I've tried to remove the braces like this:

body.replaceText(" \%\n","\t");

and

body.replaceText(" \\%\n","\t");

But each gives different results, frankly - messing up the entire receipt text badly.

So the percent sign is the problem - I think.

How can I fix the formatting for the "TAX 1" and "TAX 2" lines?

Example fulltext OCR'd receipt: https://docs.google.com/document/d/1zSV09UGajna4DPtrHUrB6F82NugpYjaftMjomoKVXpE/edit?usp=sharing

I'm not sure whether I can understand what you want. If this was not what you want, please tell me. If you want to retrieve ``Desired Output Text:`` from ``Example fulltext OCR'd receipt:``, how about this sample? ``var r = DocumentApp.getActiveDocument().getBody().getText().match(/(SUBTOTAL[\s\S]+?)VISA/)[1].replace(/TAX/g, "\nTAX").replace(/ TOTAL/g, "\nTOTAL\t").replace(/%\n/g, "%\t")`` In this sample, it supposes that ``Example fulltext OCR'd receipt:`` is in a Google Document. — Tanaike, Jul 19 '18 at 00:10
Hello Tanaike! Thank you! I tried your code got this error: TypeError: Cannot read property "1" from null. The crux of what I am trying to achieve is where I wrote: "(this is what is failing)." I guess my question is can I have the value 7.78 that is on the line after "TAX 1 6.750 %" be automatically brought up one line so that it is flush with / same line as "TAX 1 6.750 %" so that it looks like "TAX 1 6.750 % 7.78" ? If that can be made to happen then the TAX 2 value can be made to join the above line as well I guess... — ONDEV, Jul 19 '18 at 00:22
I'm really sorry for the inconvenience. I cannot understand the modification point because in my environment, it works. I did as follows. I copied and pasted ``Example fulltext OCR'd receipt:`` to a new Google Document, and the script is run at the container-bound script of the Document. In this case, no error occurred. If my understanding for your situation is not correct, can you provide the sample input and output you want? I'm really sorry for my poor English skill. — Tanaike, Jul 19 '18 at 01:05
Not a problem at all! I tried to copy that Example fulltext OCR'd receipt: also and it was no longer formatted properly - which may or may not be the problem so what I did was create shared doc for you and others to refer to: https://docs.google.com/document/d/1zSV09UGajna4DPtrHUrB6F82NugpYjaftMjomoKVXpE/edit?usp=sharing I placed my question into a comment for you to see in that document — ONDEV, Jul 19 '18 at 02:22
Thank you for sharing the sample. When I tried the script for your shared sample, no error occurs. And I got the result like ``Desired Output Text:``. Can I ask you about the flow you did for running the script? — Tanaike, Jul 19 '18 at 02:26
I carried out the following flow. 1. Open the shared document. 2. Open script editor. 3. Copy and paste the script in ``myFunction()``. 4. Add ``Logger.log(r)`` in ``myFunction()``. 5. Run ``myFunction()``. 6. Authorize the scopes. 7. See the result. — Tanaike, Jul 19 '18 at 02:31
It worked! Thank you! Now, my question is how to get the variable "r" into the text? I can see the change when I view the log but it doesnt show in the actual document. Is there some way to have it work in the document only? I tried to remove "var r = " and just run it as an action on the document text but no change was made (and no error was produced either). — ONDEV, Jul 19 '18 at 10:27

score 1 · Accepted Answer · answered Jul 19 '18 at 22:32

You want to replace the text of shared Document to the values you want. The values you want can be retrieved by the script in my comment. If my understanding is correct, how about this?

In the sample script in my comment, the whole text is retrieved and replace to the values you want. When this situation is reflected to the Document, how about this sample script? When you use this script, please do as follows.

Open the shared document.
Open script editor.
Copy and paste the sample script to the script editor.
Run myFunction().
Authorize the scopes.
See the Document.

Sample script :

function myFunction() {
  var body = DocumentApp.getActiveDocument().getBody();
  var newText = body.getText()
    .match(/(SUBTOTAL[\s\S]+?)VISA/)[1]
    .replace(/TAX/g, "\nTAX")
    .replace(/ TOTAL/g, "\nTOTAL\t")
    .replace(/%\n/g, "%\t");
  body.clear();
  body.setText(newText);
}

If this was not what you want, I'm sorry.

@ONDEV Welcome. I'm glad your issue was solved. Thank you, too. — Tanaike, Jul 19 '18 at 23:01

String replacement in OCR'd purchase receipts

1 Answers1

Sample script :