0

I have a string of account information with multiple accounts in the string (the example shows one line, where I actually have a text file with multiple lines of account data, so there is another loop going through each line in the text file in my code). I need to pull out each account into its own line. The code below works, but I assume there is a more efficient or better way to do it. I am just starting to learn Regex.

import re
import pandas as pd

allAccounts = []
example = '02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73'  

rex = '[0-9]{1,2}-[0-9]{1,7}-[0-9]{1,2}'
accounts = re.findall(rex, example)
for account in accounts:       
    example= example.replace(account, f'||{account}')
       
example = [account.replace('  ', '|').split('|') for account in example.split('||')][1:]
allAccounts += example
df = pd.DataFrame(allAccounts)
df

from the regex portion of the code, I want it to return:

['02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42', ' 02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73'] 
# or 
'||02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  ||02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73'

The code returns the following df which is what ultimately I want:

                0       1             2     3          4         5      6      7       8
0   02-0015800-00   NAME1   100 SOME ST Active  3/8/2021    139.23  139.81  0.42    
1   02-0023901-01   NAME2   101 SOME ST Active  3/8/2021    512.33  482.96  -5.73   None

But I feel like there is a better way to use the Regex than what I am doing. Reading the docs it seems like re.sub should do it, but it is only replacing the first account number it comes upon, and it only want to replace the account number not add the '||' separator to the beginning.


update:

Using the following it gets close to what I want but not sure why the first item in the list is ''.

example = '02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73'  
rex = re.compile('(?=[0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9])')
re.split(rex, example)

outputs:

['',
 '02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  ',
 '02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73']
Infinity Cliff
  • 356
  • 2
  • 6
  • 20
  • Do you actually need a regular expression that matches the entire account info, or do you just need to be able to find the boundaries? If the latter, what is the actual rule that tells you where one account's info ends and the next begins? While looking through the `re` module documentation, did you see `re.split`? 90% of programming is *precisely* understanding the problem that needs to be solved. – Karl Knechtel May 10 '22 at 14:58
  • I cannot get ```re.split``` to work. what I want the regex to return is ```['02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42', ' 02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73']``` or ```'||02-0015800-00 NAME1 100 SOME ST Active 3/8/2021 139.23 139.81 0.42 ||02-0023901-01 NAME2 101 SOME ST Active 3/8/2021 512.33 482.96 -5.73'```. Where in the latter I can use the '||' to split each account into a list. – Infinity Cliff May 10 '22 at 15:52
  • Another approach: there is a set number of fields in each line (seems to be eight of them), right? What if you just split the data into fields (separated by the double space), and then reshaped that data into 8-by-N? – Karl Knechtel May 10 '22 at 15:54
  • @Karl Knechtel I wish it were that easy, but they are not all 8, some have 7 and some 9 – Infinity Cliff May 10 '22 at 15:57
  • Where did this data come from? In the cases where there are 9 values, what do they all represent? In the cases with fewer than 9, is it always the same ones that are missing? Because ragged data like that is a much bigger problem that should have been mentioned up front - the point of a `DataFrame` is that it represents a rectangular, consistent table of data. – Karl Knechtel May 10 '22 at 16:03
  • @Karl Knechtel no, you are reading too much into it. Once I have the data separated into lists I will only be using the first 5 items in the list to make the dataframe, so that is not part of the question, just showing the ultimate end results. I am only asking about the regex part of the code. – Infinity Cliff May 10 '22 at 16:33
  • @Karl Knechtel I edited the question a bit to hopefully make things clearer – Infinity Cliff May 10 '22 at 16:39

1 Answers1

1

Instead of using split, you can match the values:

\b\d\d-\d{7}-\d\d\b.*?(?=\s*\b\d\d-\d{7}-\d\d\b.*?|$)

Explanation

  • \b\d\d-\d{7}-\d\d\b Match the pattern with 2 digits - 7 digits - 2 digits using a quantifier
  • .*? Match any character as least as possible
  • (?=\s*\b\d\d-\d{7}-\d\d\b.*?|$) Positive lookahead to assert either the digits pattern to the right or the end of the string to also match the last occurrence

Regex demo

Example

import re

pattern = r"\b\d\d-\d{7}-\d\d\b.*?(?=\s*\b\d\d-\d{7}-\d\d\b.*?|$)"

s = "02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73"

print(re.findall(pattern, s))

Output

['02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42', '02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73']

If you must use split:

import re

pattern = r"\b(?=\d\d-\d{7}-\d\d\b)"

s = "02-0015800-00  NAME1  100 SOME ST  Active  3/8/2021  139.23  139.81  0.42  02-0023901-01  NAME2  101 SOME ST  Active  3/8/2021  512.33  482.96  -5.73"

result = [m.strip() for m in re.split(pattern, s) if m]
print(result)

See a Python demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • forth bird I like the split version of your answer, at least for me, it is easier to see what the code is doing. Thanks for explanation, it helps a lot. What is purpose of the ```\b``` at the beginning and end of the pattern? Is one version faster then the other? – Infinity Cliff May 13 '22 at 12:17