0

I have a plain text like this:

Cart ID: A3N42M / Copy: A3N42P PO: 5000021337 Invoice: 3110021337
Cart ID: A3N3ZW / Copy: A3N3ZX/ PO: 5000021335 Invoice: 3110021335 
Cart ID: A3N3ZL / Copy: A3N3ZM PO: 5000021336 Invoice: 3110021336 
Original: A3N444 / Copy: A3N445 PO: 5000021340 Invoice: 3130021340
Original: A3N44C / Copy: A3N44D PO: 5000021341 Invoice: 3130021341
Original: A3N44G / Copy: A3N44H PO: 5000021342 Invoice: 3130021342
Cart ID: A3N3ZZ / Copy: A3N428 PO: A3N3ZZ01
Cart ID: A3N3ZQ / Copy: A3N3ZV PO: A3N3ZQ01
Cart ID: A3N336 / Copy: A3N337 PO: A3N33601
E3M49D / E3M49Q - PO: 4620028049
E3M49N / E3M49X
E3M49P / E3M49Y

And I need to split the cart IDs from that text, and end up in something like this:

A3N42M
A3N42P
A3N3ZW
...
E3M49N
...

These IDs are always 6-digit length, and always starts with A, E or P (AXXXXX, EXXXXX, PXXXXX, etc...).

Is there any way (using any script or any program language) that I can achieve this?

This plain text is currently on a spreadsheet table, and I need to separate these IDs to use in a SQL query for later, thanks!

tmthydvnprt
  • 10,398
  • 8
  • 52
  • 72
ryoishikawa74
  • 177
  • 3
  • 11
  • Almost every language could perform this if it had too, what language do you know or would you prefer to work in? – tmthydvnprt Jul 26 '16 at 14:33
  • If you use python look at [`split`](https://docs.python.org/2/library/stdtypes.html#str.split) or [`regex`](https://docs.python.org/2/library/re.html#re.split) – tmthydvnprt Jul 26 '16 at 14:35
  • Actually I just have a very basic program language knowledge, little bit of Java ... but not enough to try to achieve this – ryoishikawa74 Jul 26 '16 at 14:36
  • You could also use a text editor with regex find and replace to perform splits. – tmthydvnprt Jul 26 '16 at 14:38
  • what text editor do you use? – tmthydvnprt Jul 26 '16 at 15:31
  • I use Notepad++, I saw some tips on the internet on how to do it bookmarking lines etc, but I still did not manage to make it work, but I'll use the python code below that a fellow posted, thanks a lot buddy :-) – ryoishikawa74 Jul 26 '16 at 15:38
  • When you say that the plain text is currently in a spreadsheet, do you mean that all of that text is in one cell? Or is it spanned across multiple cells. – dustinroepsch Jul 26 '16 at 15:38
  • Notepad++ has regex – tmthydvnprt Jul 26 '16 at 15:40
  • @DustinRyan-Roepsch It's all in the same cell ... I could do a simple `=SPLIT` and use space as delimiter, but it would also give me work in order to delete the cells that does not have the content I want. – ryoishikawa74 Jul 26 '16 at 19:35

1 Answers1

1

Regex Data-Munging


You can perform this via regex substitutions (or find and replace). I happened to do this in python, but you could use the same regex patterns ('^[^/]*?: ', '/.*') in any language or text editor that supports them.

Regex Explanation

  1. '^[^/]*?: ' - start and beginning of string (first ^), match multiple non-\ characters in a non greedy way ([^/]*?]), a :, then a .
  2. '/.*' - match all /, then multiple any character (.)

Processing Example (in Python)

import re

text = '''
Cart ID: A3N42M / Copy: A3N42P PO: 5000021337 Invoice: 3110021337
Cart ID: A3N3ZW / Copy: A3N3ZX/ PO: 5000021335 Invoice: 3110021335 
Cart ID: A3N3ZL / Copy: A3N3ZM PO: 5000021336 Invoice: 3110021336 
Original: A3N444 / Copy: A3N445 PO: 5000021340 Invoice: 3130021340
Original: A3N44C / Copy: A3N44D PO: 5000021341 Invoice: 3130021341
Original: A3N44G / Copy: A3N44H PO: 5000021342 Invoice: 3130021342
Cart ID: A3N3ZZ / Copy: A3N428 PO: A3N3ZZ01
Cart ID: A3N3ZQ / Copy: A3N3ZV PO: A3N3ZQ01
Cart ID: A3N336 / Copy: A3N337 PO: A3N33601
E3M49D / E3M49Q - PO: 4620028049
E3M49N / E3M49X
E3M49P / E3M49Y
'''

text = re.sub('^[^/]*?: ([]*?)', '', text, flags=re.MULTILINE)
text = re.sub('/.*', '', text)

print text

A3N42M 
A3N3ZW 
A3N3ZL 
A3N444 
A3N44C 
A3N44G 
A3N3ZZ 
A3N3ZQ 
A3N336 
E3M49D 
E3M49N 
E3M49P

EDIT


Updated Regex Explanation

Updated regex per author's request.

  1. match any character (.*?) (non-greedy), followed by a capture group of ((...)) of a character class ([AEP]) followed by at least one character class ([0-9]+) followed by four word characters (\w{4}), followed by another capture group of any character ((.*))
  2. All of this is matched and replaced with the capture group variables with a newline in between (\1\n\2), essentially splitting rows where IDs occur twice
  3. Repeat steps with only the first capture group to handle the newlines containing the second ID

Updated Processing

text = re.sub(r'.*?([AEP][0-9]+\w{4})(.*)', r'\1\n\2', text, flags=re.MULTILINE)
text = re.sub(r'.*?([AEP][0-9]+\w{4}).*', r'\1', text, flags=re.MULTILINE)

print text

A3N42M
A3N42P
A3N3ZW
A3N3ZX
A3N3ZL
A3N3ZM
A3N444
A3N445
A3N44C
A3N44D
A3N44G
A3N44H
A3N3ZZ
A3N428
A3N3ZQ
A3N3ZV
A3N336
A3N337
E3M49D
E3M49Q
E3M49N
E3M49X
E3M49P
E3M49Y
Community
  • 1
  • 1
tmthydvnprt
  • 10,398
  • 8
  • 52
  • 72
  • Hello, I installed Python and tried to run the .py file using the command prompt, but I got this: `SyntaxError: Missing parentheses in call to 'print'`, am I doing anything wrong? Sorry never played with Python before, I just ran: `python file.py` in the command prompt. – ryoishikawa74 Jul 26 '16 at 15:50
  • If you have Python 3 use `print(text)` – tmthydvnprt Jul 26 '16 at 15:52
  • Thanks! it almost did the trick, just some IDs missing after the `/` in the text, as you can see in this [Image](http://imgur.com/a/wSWLP) – ryoishikawa74 Jul 26 '16 at 16:00
  • Oh assumed you only wanted the first IDs. You will need to change your `regex` patterns to handle that. But now you have a recipe to start with. Have a look at [regex101.com](http://www.regex101.com) to play with developing useful patterns. – tmthydvnprt Jul 26 '16 at 16:34
  • 1
    Thanks for the link, I did a little research and I could come up with a regex that found all my cart IDs in the list: `(a|e|p)[0-9]{1,}\w{4}` -- [IMAGE](http://imgur.com/a/WcImY) -- Now, is it possible in that python code to use this regex and delete everything that does not match it? – ryoishikawa74 Jul 26 '16 at 17:28
  • Remember to accept the answer and up vote if you approve, [we both get rep! (you get +2 reputation for accepting!)](http://meta.stackexchange.com/a/5235/292533) – tmthydvnprt Jul 28 '16 at 04:36
  • 1
    Thank you very much! Worked like a charm, I accepted your post as the answer, I just cannot up vote due to my reputation, sorry for the delay! :-) – ryoishikawa74 Aug 01 '16 at 19:00