Regular expressions python - get only the description

Question

i am newbie in python, and i am trying to use RE to transform some PDF in DF.

So, for now i have a list with this information

list = ['9076968 ADT 10mg 60comp 22CN014A T E1 059366 5 2,72 1,97 1,56 0,0 0,01 6 1,57 7,85',
 '9076943 ADT 25mg 60comp 22CN061A T E1 059366 10 3,91 3,09 2,60 0,0 0,01 6 2,61 26,10',
 '3506888 Aerius 5mg 20comp W010992 T E1 094546 5 4,99 4,11 3,53 10,0 0,02 6 3,20 16,00',
 '9046755 Aldactone 25mg 60comp B28191 G E1 084399 10 5,42 4,51 3,90 22,0 0,02 6 3,06 30,60',
 '5282132 Aranka MG 3mg+0,03mg 63comp T21521A G E2 087961 5 8,22 6,51 5,47 12,5 0,03 6 4,82 24,10',
 '6589168 Arnidol Gel Stick 15ml S-02 G NETT 054786 5 5,80 16,0 0,00 23 4,87 24,35',
 '5260542 Atorvastatina Azevedos MG 10mg 56comp 11400 T E1 094546 10 3,76 2,94 2,46 55,0 0,01 6 1,12 11,20',
 '5260559 Atorvastatina Azevedos MG 20mg 28comp 20515 T E1 059366 20 3,57 2,76 2,29 55,0 0,01 6 1,04 20,80',
 '5260575 Atorvastatina Azevedos MG 40mg 28comp 20516 T E1 059366 10 4,46 3,61 3,07 55,0 0,02 6 1,40 14,00',
 '5629506 Atozet 10mg+20mg 30comp W016401 N E5 093541 5 41,63 34,59 29,72 0,0 0,16 6 29,88 149,40',
 '7377390 Atyflor 10saq 124011 G NETT 087961 5 8,25 14,3 0,00 23 7,07 35,35',
 '2003093 Bebegel Gel Retal 6un 2206EA M NETT 024839 5 4,00 0,0 0,00 6 4,00 20,00',
 '8435701 Betadine Solucao Cutanea 125ml 326893 M NETT 084780 10 4,20 0,0 0,00 6 4,20 42,00',
 '2869584 Betamox Plus 875mg+125mg 16comp R017905R T E1 093541 30 6,34 5,39 4,71 60,0 0,02 6 1,90 57,00',
 '8184812 Betnovate 1mg/g Pomada 30g S63C N E1 022851 5 3,46 2,66 2,20 0,0 0,01 6 2,21 11,05',
 '2184992 Biloban 40mg 60comp rev R002315R T E2 059366 10 9,57 7,44 6,32 10,0 0,04 6 5,73 57,30',
 '5065487 Bisoprolol Sandoz MG 5mg 56comp LX8098 N E1 022851 5 5,01 4,13 3,55 0,0 0,02 6 3,57 17,85',
 '5138276 Buprenorfina Azevedos MG 8mg 7comp (P) 22E16 T E3 087485 30 11,15 8,83 7,42 5,0 0,04 6 7,09 212,70',
 '3126489 Calcitab 1500mg 60comp EQ22502 N E1 054786 5 6,29 5,34 4,66 0,0 0,02 6 4,68 23,40',
 '9729509 Cartia 100mg 28comp 20015 G E1 022851 30 5,41 4,13 3,55 45,0 0,02 6 1,97 59,10',
 '5037288 Ciprofloxacina Azevedos MG 500mg 16comp 11496 T E3 054786 5 10,87 8,57 7,18 70,0 0,04 6 2,19 10,95',
 '5273487 Co-Diovan Forte 160mg+25mg 28comp TRM93 N E2 022851 5 8,10 6,40 5,37 0,0 0,03 6 5,40 27,00',
 '8287607 Cordarone 200 mg x 60 Comprimidos 2R362 N E3 022851 5 11,36 9,03 7,61 0,0 0,04 6 7,65 38,25',
 '5440284 Coversyl 5mg 30comp rev 711191 N E1 022851 10 6,47 5,52 4,83 0,0 0,02 6 4,85 48,50',
 '5627781 Cozaar Plus 100mg + 12,5mg 28comp W020945 T E2 054786 5 7,69 6,01 5,01 9,0 0,03 6 4,59 22,95'

i want to grab de descrition of every line starting in index 8, after 7 number characters + one space, and stop in the space before the last letter that can be T, N, G, M.

Example : 5627781 Cozaar Plus 100mg + 12,5mg 28comp W020945 T E2 054786 5 7,69 6,01 5,01 9,0 0,03 6 4,59 22,95'

Cozaar Plus 100mg + 12,5mg 28comp W020945 or better Cozaar Plus 100mg + 12,5mg 28comp

-> W020945 is the Lot information, but it's not a standard for every line

i try something like this

description_re = re.compile(r'\d{7}\s[A-Za-z]+\s[TNGM]$') but dont work

Tanks

Your input has a lot of variability, I noticed a pattern that tablets have `mg` then `comp` while liquids have `ml`. So I could write a regexp that stops at `ml` or `comp`, matches one last alfanumeric code (lot info), and then returns, but some lines still don't match like "5440284 Coversyl 5mg 30comp rev", the `rev` breaks it. Can I add the exception or should I expect any possible random word before the lot number? — Fra93, Oct 10 '22 at 11:52
Hi, Fra93, you should expect any random word before lot númber like caps, orod,etc — foliveir, Oct 10 '22 at 12:17

score 0 · Answer 1 · answered Oct 10 '22 at 11:54

0

Using positive look behinds and look aheads will help you out:

(?<=\d{7} ).*?(?= \w+ [TNGM] )

regex101

answered Oct 10 '22 at 11:54

VvdL

2,799
1
3
14

This still leaves behind one line: `6589168 Arnidol Gel Stick 15ml S-02 G NETT 054786 5 5,80 16,0 0,00 23 4,87 24,35` even though I don't understand why. The lookahead should match ` G NETT` with ` \w+ [TNGM]` – Fra93 Oct 10 '22 at 12:00

The fourth bird · Answer 2 · 2022-10-10T12:48:59.990

0

You can use a capture group and word boundaries:

\b\d{7}\s(.*?)\s[TNGM]\b

Explanation

\b\d{7} Aword boundary, than match 7 digits
\s+([A-Za-z].*?)\s+ Capture a char A-Za-z followed by as least as possible chars between whitespace chars
[TNGM]\b Match one of the listed characters

Regex demo

edited Oct 10 '22 at 12:48

answered Oct 10 '22 at 12:15

The fourth bird

154,723
16
55
70

score 0 · Answer 3 · answered Oct 10 '22 at 12:27

0

I modify the regex code of answer 1. Here is the code:

(?<=\d{7}\s).*(?=\s\S+\s[TNGM]\s)

And demo

answered Oct 10 '22 at 12:27

M..

26
2

HI, i try with your code but my code does't print anything, can you help me saying whats wrong? '''descricao_fatura_re = re.compile(r"(?<=\d{7}\s).*(?=\s\S+\s[TNGM]\s)") for linha in fatura_completa: if (descricao_fatura_re.match(linha)): print(descricao_fatura_re.match(linha))''' – foliveir Oct 10 '22 at 13:08
@foliveir If you use python, I think you should use function named search, not match.Because match function only match from start of the string. – M.. Oct 11 '22 at 09:12
Thanks, I am trying to use Python, but I am in the beginning – foliveir Oct 11 '22 at 18:58
@foliveir That's okay~ I searched it on https://docs.python.org/3/library/re.html just after you asked me. – M.. Oct 12 '22 at 08:01

Regular expressions python - get only the description

3 Answers3