-1

I have a file which is separated by newline into chunks of the same number of rows. Each row is a field. For example, in chunk1, the first field = a1,a2,a3. In chunk2, the same field = a2,a3,a4.

a1,a2,a3
b1
c1,c2,c3,c4
d1
e1

a2,a3,a4
b2
c3,c4
d2
e2

a3,a5
b3
c4,c6
d3
e3

How can I get a dataframe (or other data structure) like below?

    f1        f2       f3            f4  f5 
    a1,a2,a3  b1       c1,c2,c3,c4   d1  e1
    a2,a3,a4  b2       c3,c4         d2  e2
    a3,a5     b3       c4,c6         d3  e3

Thanks!

Jia
  • 1,301
  • 1
  • 12
  • 18

3 Answers3

2

An open file is an iterator of lines. You want an iterator of groups of lines.

Since all of these groups are 6 lines long (counting the blank line at the end), the easiest way to do this is to use the grouper example from the itertools recipes in the docs. (You can also get a pre-made version from the more-itertools library on PyPI, if you prefer.)

from itertools import *

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

with open(path) as f:
    for group in grouper(f, 6):
        do_something(group)

If the length of your groups isn't known in advance (even if it will always be consistent within a file), you can instead use groupby to create alternating groups of empty and non-empty lines. This is kind of like using split on a string.

We can just use bool as a key function here—a non-empty line is truthy, and an empty line is falsey. (If that seems odd to you, you can write something like lambda line: line or lambda line: line != '' instead.)

with open(path) as f:
    for nonempty, group in groupby(f, bool):
        if nonempty:
            do_something(group)

Or, if this seems way over your head… well, first read David Beazley's Generator Tricks for Systems Programmers, and maybe it won't be over your head anymore. But if it is, we can do the same thing a bit more explicitly:

with open(path) as f:
    group = []
    for line in f:
        if line:
            group.append(line)
        else:
            do_something(group)
            group = []
    if group:
        do_something(group)
abarnert
  • 354,177
  • 51
  • 601
  • 671
0

If you can use pandas and know how many fields there are:

fields = 5
df = pd.read_table('data.txt', header=None)
df = pd.DataFrame(df.values.reshape(-1, fields)))

If don't know how many fields:

df = (pd
      .read_table('data.txt', header=None, skip_blank_lines=False)
      .append([np.nan]))
# empty lines become NaN. Find the first of them.
fields = np.where(pd.isnull(f))[0][0]
df = pd.DataFrame(df.values.reshape(-1, fields + 1)))
del df[df.columns[-1]]  # delete the NaN column
ldavid
  • 2,512
  • 2
  • 22
  • 38
0

you can try generator approach:

def chunks_by_space(file):
    with open(file,'r') as f:
        data=[line.strip() for line in f.readlines()]
        store=[]

        for line_no,value in enumerate(data):
            if value=='':
                yield store
                store=[]
            else:
                store.append(value)
        yield store

gen=chunks_by_space('file_name')
print(list(zip(*gen)))

output:

[('a1,a2,a3', 'a2,a3,a4', 'a3,a5'), ('b1', 'b2', 'b3'), ('c1,c2,c3,c4', 'c3,c4', 'c4,c6'), ('d1', 'd2', 'd3'), ('e1', 'e2', 'e3')]
Aaditya Ura
  • 12,007
  • 7
  • 50
  • 88