5

I have a (very ugly) txt output from an SQL query which is performed by external system that I can't change. Here is the output example:

FruitName      Owner             OwnerPhone
=============  ================= ============
Red Apple      Sr Lorem Ipsum    123123
Yellow Banana  Ms Dolor sir Amet 456456

As you can see, the FruitName column and the Owner column may consists of few words and there's no fixed pattern in how many words could be in these columns. If I use line.split() to make array on each line Python, it will remove all the whitespace and make the array become like this:

['Red', 'Apple', 'Sr', 'Lorem', 'Ipsum', '123123']
['Yellow', 'Banana', 'Ms', 'Dolor', 'sir', 'Amet', '456456']

The question is, how can I split it properly into output like this:

['Red Apple', 'Sr Lorem Ipsum', '123123']
['Yellow Banana', 'Ms Dolor sir Amet', '456456']

I'm a newbie in Python and I dont know if such thing is possible or not. Any help will be very much appreciated. Thanks!

CypherX
  • 7,019
  • 3
  • 25
  • 37
randms26
  • 137
  • 1
  • 6
  • 16
  • 3
    Is the row of equals signs part of the output? If so, that'd be a very easy way to get the column widths. You can then extract the fields via slicing. – kindall Oct 16 '19 at 03:30
  • first column has length 13 chars so using `line[:13].strip()` you can get it. Similar for other columns. If you use `pandas` then it has `read_fwf` to read files with fixed-width-fields. – furas Oct 16 '19 at 03:50
  • Can you split using regular expression to split when the number of spaces are more than 2? – Ashok KS Oct 16 '19 at 04:23
  • the first and second row is not part of the output. I think I'm agree with furas's answer below – randms26 Oct 16 '19 at 04:55

3 Answers3

1

You can use the ==== dividers to your advantage in that you can get slices in all lines corresponding to the start and end indices of each ==== that represents a column:

def get_divider_indices(line):
  i, j = 0, line.index(' ')
  indices = []
  while i != -1:
    indices.append((i, j))
    i = line.find('=', j)
    j = line.find(' ', i)
    if j == -1: j = len(line)
  return indices

with open('data.txt', 'r') as f:
  lines = f.readlines()
  dividers = get_divider_indices(lines[1])
  rows= []
  for line in lines[2:]:
    rows.append([line[s:e].strip() for s, e in dividers])

print(rows)

Output

[['Red Apple', 'Sr Lorem Ipsum', '123123'], ['Yellow Banana', 'Ms Dolor sir Amet', '456456']]

Note that you can use str.find() to get the index of a character in a string (which I use above to get the index of an = or a space in the divider line).

slider
  • 12,810
  • 1
  • 26
  • 42
1

Columns have fixed widths so you can use it and slice lines

data = '''FruitName      Owner             OwnerPhone
=============  ================= ============
Red Apple      Sr Lorem Ipsum    123123
Yellow Banana  Ms Dolor sir Amet 456456'''

lines = data.split('\n')

for line in lines[2:]:
    fruit = line[:13].strip()
    owner = line[13:32].strip()
    phone = line[32:].strip()
    print([fruit, owner, phone])

More complex solution would use second line - with === - to calculate widths for columns and use them in slicing.

furas
  • 134,197
  • 12
  • 106
  • 148
  • 1
    this works for me!! i took your advise, i used the second line to calculate each column length and use the int value to get substring on the loop. Thanks! very valuable insight – randms26 Oct 16 '19 at 04:57
1

As suggested by others you can use the length of each divider to calculate the length of the columns. The following example illustrates just that:

rows = list()
with open('data.txt', 'r') as f:
    lines = f.readlines()

    dividers = lines[1].split() 

    for line in lines[2:]:
        row = []
        prvLength = 0
        for d in dividers:
            start = prvLength
            length = start+len(d)+1
            row.append(line[start:start+length].strip())
            prvLength += length
        rows.append(row)
print(rows)

Output

[['Red Apple', 'Sr Lorem Ipsum', '123123'], ['Yellow Banana', 'Ms Dolor sir Amet', '456456']]


You can also check if the columns are separated by TABS, i.e. '\t'. If so, you can just split the line of lines using line.split('\t') which would be much more simple.