2

I am using Google Docs to open Walmart receipts that I email to myself. The Walmart store that I use 99.9% of the time seems to have made some firmware update to the Ingenico POS terminal that makes it display a running SUBTOTAL after each item is identified by the scanner. Here are some images to support my question..

The POS terminal looks like this:

enter image description here

Second image is the is the electronic receipt which I email myself from their IOS app. It is presumably taken from the POS terminal because it has the extra running SUBTOTAL lines after each item like the POS terminal screen shows. It has been doing this for a few months and I've been given no reason to believe, by management, that it will be corrected any time soon.

enter image description here

The final image is my actual paper receipt. This is printed from the register, its the one that you walk out with it and show the greeter/exit person to check your buggy and the items you've purchased.

Note that it does not show the extra SUBTOTAL.

enter image description here

I open the electronic receipt in a Google Document and their automatic OCR spits out the text of the receipt. It does a pretty darn good job, I'd say its 95%+ accurate with these receipts. I apply a very crude little regex that reformats these electronic receipts so that I can enter them into a database and use that data for my family's budgeting, taxes, and so forth. That has been working very well for me, albeit I would like to further automate that process but thats for a different question some day perhaps.

Right now, that little crude regex no longer formats the receipt into something usable for me.

What I would like to do is to remove the extra SUBTOTALS from the (broken) electronic receipt but leave the last SUBTOTAL alone. I highlighted the last SUBTOTAL on the receipt, which is always there, and should remain.

I have seen two other questions that are similar but I could not apply them to my situation. One of them was: Remove all occurrences except the last one

What have I tried?

The following regex works in the online tester at regex101.com:

\nSUBTOTAL\t\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})

It took me a while to come up with that regex from searching around but essentially I want it to find all of the SUBTOTAL literals with a preceding new-line and any decimal number amount from 0.01 to 999.99) and I just want to replace what that finds with a new-line and then I can allow my other regex creation to work on that like it used to before the firmware update to the POS terminal.

The regex correctly identifies every SUBTOTAL (including the last one) on the regex101.com site. I can apply a substitution of "\n" and I am back to seeing the receipt data I can work with but there were two issues:

1) I cant replicate this using Google Apps Script. Here is my example:

function myFunction() {
  var body = DocumentApp.getActiveDocument().getBody();
  var newText = body.getText()
    .match('\nSUBTOTAL\t\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})')[1]
    .replace(/%/mgi, "%\n");
  body.clear();
  body.setText(newText);
}

2) If I were to get the above code to work, I still have the issue of wanting to leave the last SUBTOTAL intact.

Here is a Google Doc that I have set up to experiment with: https://docs.google.com/document/d/11bOJp2rmWJkvPG1FCAGsQ_n7MqTmsEdhDQtDXDY-52s/edit?usp=sharing

ONDEV
  • 566
  • 3
  • 7
  • 15
  • 4
    Removing all but last occurrence is easy with `text.replace(/pattern(?=[\s\S]*pattern)/g, '')` – Wiktor Stribiżew Dec 19 '19 at 21:57
  • 1
    Can I ask you about the result you expect using Google Apps Script? – Tanaike Dec 20 '19 at 00:55
  • Hi @Tanaike I want to remove all of the SUBTOTAL lines, including that amount next to it - I should have mentioned that(!) But leave intact the last SUBTOTAL + decimal amount. So, the 2nd image with all those SUBTOTAL lines should be made to look like the 3rd image with only one SUBTOTAL line So in the case above, the only SUBTOTAL line that should remain is the "SUBTOTAL 80.17" – ONDEV Dec 20 '19 at 01:13
  • 1
    Thank you for replying. Unfortunately, I cannot still see the result you expect. I apologize for my poor English skill. In order to correctly understand about your goal, can you include the result you expect in your shared Google Document? – Tanaike Dec 20 '19 at 01:18
  • @Tanaike sure, not a problem. Working on it now – ONDEV Dec 20 '19 at 01:20
  • @Tanaike I have completed the changes you requested. – ONDEV Dec 20 '19 at 01:28
  • @WiktorStribiżew you're suggestion does work in regex101.com (https://regex101.com/r/UgHg1U/1) but for some reason I can not implement it in Google Apps Script in a function: function tryWiktor() { var body = DocumentApp.getActiveDocument().getBody(); var newText = body.getText() newText.replace(/\nSUBTOTAL\t\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2})(?=[\s\S]*\nSUBTOTAL\t\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{2}))/g, '') body.clear(); body.setText(newText); } – ONDEV Dec 20 '19 at 01:31
  • 1
    Thank you for replying. It seems that the sample value of `https://regex101.com/r/UgHg1U/1` is different from your shared Google Document are different. When the value of your shared Google Document is used for `https://regex101.com/r/UgHg1U/1`, can you get the same result? – Tanaike Dec 20 '19 at 02:01
  • 1
    Yes, if the regex works online, there may be two reasons: 1) the regex can't "jump through" the line break (across paragraphs), 2) you have a different text in the document than the one you test against in the online tester. – Wiktor Stribiżew Dec 20 '19 at 10:32
  • @Tanaike the https://regex101.com/r/UgHg1U/1/ example was the example that Wiktor Stribiżew provided. To answer the both of you, whatever works in regex101.com doesnt seem to work in the Google Document and Wiktor believes it may be due to the line break so I am going to investigate that. – ONDEV Dec 20 '19 at 11:49

1 Answers1

3

I use this regular expresion.

// JavaScript Syntax
'/\nSUBTOTAL\s\d{1,3}\.\d{2}| SUBTOTAL\n\d{1,3}\.\d{2}/g'

Also I make a script for google docs. You can use this Google Doc and see the results.

function deleting_subs() {
  var body = DocumentApp.getActiveDocument().getBody();
  var newText = body.getText();

  var out = newText.replace(/\nSUBTOTAL\s\d{1,3}\.\d{2}|` SUBTOTAL\n\d{1,3}\.\d{2}/g, '');

  // This is need to become more readable the resulting text.
  out = out.replace(/R /g, 'R\n');

  body.clear();
  body.setText(out);
}

To execute the script, open the google doc file and click on:

  • Add ons.
  • Del_subs -> Deleting Subs.

Tip: After execute the complement/add on (Deleting Subs), undo the document edition, in that way other users can return to previous version of the text.

Hope this help to you.

Franco Gil
  • 323
  • 3
  • 11
  • 1
    The regex example it is here: https://regex101.com/r/3cld6B/2 – Franco Gil Dec 20 '19 at 14:33
  • Thank you! This seems to do the trick, I will need to study that regex and will incorporate it into my existing one that formats the data for database loading. – ONDEV Dec 20 '19 at 15:06