0

i am newbie in python, and i am trying to use RE to transform some PDF in DF.

So, for now i have a list with this information

list = ['9076968 ADT 10mg 60comp 22CN014A T E1 059366 5 2,72 1,97 1,56 0,0 0,01 6 1,57 7,85',
 '9076943 ADT 25mg 60comp 22CN061A T E1 059366 10 3,91 3,09 2,60 0,0 0,01 6 2,61 26,10',
 '3506888 Aerius 5mg 20comp W010992 T E1 094546 5 4,99 4,11 3,53 10,0 0,02 6 3,20 16,00',
 '9046755 Aldactone 25mg 60comp B28191 G E1 084399 10 5,42 4,51 3,90 22,0 0,02 6 3,06 30,60',
 '5282132 Aranka MG 3mg+0,03mg 63comp T21521A G E2 087961 5 8,22 6,51 5,47 12,5 0,03 6 4,82 24,10',
 '6589168 Arnidol Gel Stick 15ml S-02 G NETT 054786 5 5,80 16,0 0,00 23 4,87 24,35',
 '5260542 Atorvastatina Azevedos MG 10mg 56comp 11400 T E1 094546 10 3,76 2,94 2,46 55,0 0,01 6 1,12 11,20',
 '5260559 Atorvastatina Azevedos MG 20mg 28comp 20515 T E1 059366 20 3,57 2,76 2,29 55,0 0,01 6 1,04 20,80',
 '5260575 Atorvastatina Azevedos MG 40mg 28comp 20516 T E1 059366 10 4,46 3,61 3,07 55,0 0,02 6 1,40 14,00',
 '5629506 Atozet 10mg+20mg 30comp W016401 N E5 093541 5 41,63 34,59 29,72 0,0 0,16 6 29,88 149,40',
 '7377390 Atyflor 10saq 124011 G NETT 087961 5 8,25 14,3 0,00 23 7,07 35,35',
 '2003093 Bebegel Gel Retal 6un 2206EA M NETT 024839 5 4,00 0,0 0,00 6 4,00 20,00',
 '8435701 Betadine Solucao Cutanea 125ml 326893 M NETT 084780 10 4,20 0,0 0,00 6 4,20 42,00',
 '2869584 Betamox Plus 875mg+125mg 16comp R017905R T E1 093541 30 6,34 5,39 4,71 60,0 0,02 6 1,90 57,00',
 '8184812 Betnovate 1mg/g Pomada 30g S63C N E1 022851 5 3,46 2,66 2,20 0,0 0,01 6 2,21 11,05',
 '2184992 Biloban 40mg 60comp rev R002315R T E2 059366 10 9,57 7,44 6,32 10,0 0,04 6 5,73 57,30',
 '5065487 Bisoprolol Sandoz MG 5mg 56comp LX8098 N E1 022851 5 5,01 4,13 3,55 0,0 0,02 6 3,57 17,85',
 '5138276 Buprenorfina Azevedos MG 8mg 7comp (P) 22E16 T E3 087485 30 11,15 8,83 7,42 5,0 0,04 6 7,09 212,70',
 '3126489 Calcitab 1500mg 60comp EQ22502 N E1 054786 5 6,29 5,34 4,66 0,0 0,02 6 4,68 23,40',
 '9729509 Cartia 100mg 28comp 20015 G E1 022851 30 5,41 4,13 3,55 45,0 0,02 6 1,97 59,10',
 '5037288 Ciprofloxacina Azevedos MG 500mg 16comp 11496 T E3 054786 5 10,87 8,57 7,18 70,0 0,04 6 2,19 10,95',
 '5273487 Co-Diovan Forte 160mg+25mg 28comp TRM93 N E2 022851 5 8,10 6,40 5,37 0,0 0,03 6 5,40 27,00',
 '8287607 Cordarone 200 mg x 60 Comprimidos 2R362 N E3 022851 5 11,36 9,03 7,61 0,0 0,04 6 7,65 38,25',
 '5440284 Coversyl 5mg 30comp rev 711191 N E1 022851 10 6,47 5,52 4,83 0,0 0,02 6 4,85 48,50',
 '5627781 Cozaar Plus 100mg + 12,5mg 28comp W020945 T E2 054786 5 7,69 6,01 5,01 9,0 0,03 6 4,59 22,95'

i want to grab de descrition of every line starting in index 8, after 7 number characters + one space, and stop in the space before the last letter that can be T, N, G, M.

Example : 5627781 Cozaar Plus 100mg + 12,5mg 28comp W020945 T E2 054786 5 7,69 6,01 5,01 9,0 0,03 6 4,59 22,95'

  • Cozaar Plus 100mg + 12,5mg 28comp W020945 or better Cozaar Plus 100mg + 12,5mg 28comp

-> W020945 is the Lot information, but it's not a standard for every line

i try something like this

description_re = re.compile(r'\d{7}\s[A-Za-z]+\s[TNGM]$') but dont work

Tanks

Fra93
  • 1,992
  • 1
  • 9
  • 18
foliveir
  • 59
  • 5
  • Your input has a lot of variability, I noticed a pattern that tablets have `mg` then `comp` while liquids have `ml`. So I could write a regexp that stops at `ml` or `comp`, matches one last alfanumeric code (lot info), and then returns, but some lines still don't match like "5440284 Coversyl 5mg 30comp rev", the `rev` breaks it. Can I add the exception or should I expect any possible random word before the lot number? – Fra93 Oct 10 '22 at 11:52
  • Hi, Fra93, you should expect any random word before lot númber like caps, orod,etc – foliveir Oct 10 '22 at 12:17

3 Answers3

0

Using positive look behinds and look aheads will help you out:

(?<=\d{7} ).*?(?= \w+ [TNGM] )

regex101

VvdL
  • 2,799
  • 1
  • 3
  • 14
  • This still leaves behind one line: `6589168 Arnidol Gel Stick 15ml S-02 G NETT 054786 5 5,80 16,0 0,00 23 4,87 24,35` even though I don't understand why. The lookahead should match ` G NETT` with ` \w+ [TNGM]` – Fra93 Oct 10 '22 at 12:00
0

You can use a capture group and word boundaries:

\b\d{7}\s(.*?)\s[TNGM]\b

Explanation

  • \b\d{7} Aword boundary, than match 7 digits
  • \s+([A-Za-z].*?)\s+ Capture a char A-Za-z followed by as least as possible chars between whitespace chars
  • [TNGM]\b Match one of the listed characters

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

I modify the regex code of answer 1. Here is the code:

(?<=\d{7}\s).*(?=\s\S+\s[TNGM]\s)

And demo

M..
  • 26
  • 2
  • HI, i try with your code but my code does't print anything, can you help me saying whats wrong? '''descricao_fatura_re = re.compile(r"(?<=\d{7}\s).*(?=\s\S+\s[TNGM]\s)") for linha in fatura_completa: if (descricao_fatura_re.match(linha)): print(descricao_fatura_re.match(linha))''' – foliveir Oct 10 '22 at 13:08
  • @foliveir If you use python, I think you should use function named search, not match.Because match function only match from start of the string. – M.. Oct 11 '22 at 09:12
  • Thanks, I am trying to use Python, but I am in the beginning – foliveir Oct 11 '22 at 18:58
  • @foliveir That's okay~ I searched it on https://docs.python.org/3/library/re.html just after you asked me. – M.. Oct 12 '22 at 08:01