1

I have several data in csv files (similar data structure but not the same), with different sizes of rows and columns for certain lines.

For example, the first three lines of each csv file has varying number of columns, ie:

----------------
Table | Format |
----------------
Code  | Label  | Index  |
-----------------------------------------
  a   |   b    |   c    |   d   |   e   |
-----------------------------------------

which does kind of look ugly, and makes it difficult to read in as pandas to work with.

I want to make the table so that it recognizes the max length of columns in a file and add whatever padding to the empty spaces to make the dimensions equal. ie.

-----------------------------------------
Table | Format |  pad   |  pad  |  pad  |
-----------------------------------------
Code  | Label  | Index  |  pad  |  pad  |
-----------------------------------------
  a   |   b    |   c    |   d   |   e   |
-----------------------------------------

So far, I looked into reading pandas and adding headers to csv file, but because the maximum number of columns for each csv file varies, I've been kind of stuck.

Any help or pointer would be grateful!

Vinci
  • 365
  • 1
  • 6
  • 16

2 Answers2

1

If your column separator is a comma, you can pad by simply inserting an appropriate number of commas at the end of each row. Using read_csv pandas will read the padded values in as NaN.

with open('/path/to/data.csv', 'r') as f:
    data = f.read().split()

# Count the the number of columns in each line
cols = [row.count(',')+1 for row in data]
# Find the widest row
max_cols = max(cols)

# Loop over lines in text
for id, row in enumerate(data):
    # Pad extra columns when necessary
    if cols[id] < max_cols:
        data[id] += (max_cols - cols[id]) * ','

# Write the data out
with open('/path/to/pad_data.csv', 'w') as f:
    f.write('\n'.join(data))

Setting up some test data:

data = '1,2,3\n4,\n5,6,7,8,9\n'
print(data)
#1,2,3
#4,
#5,6,7,8,9

Applying the method above gives:

print('\n'.join(pad_data))
#1,2,3,,
#4,,,,
#5,6,7,8,9
jwalton
  • 5,286
  • 1
  • 18
  • 36
  • I tried using `skiprows` before but I realized some of rows randomly have more varying number of columns, so I couldn't specify which rows to skip specifically. – Vinci Feb 20 '19 at 16:48
  • Okay. Given that some rows have more columns than others I'd reconsider the appropriateness of using a dataframe. – jwalton Feb 20 '19 at 16:52
  • Please see my edit for a solution not using skiprows – jwalton Feb 20 '19 at 17:08
1

Here's a little script I wrote to pad out columns derived from a pandas dataframe. My file was intermediate file was pipe-delimited:

INPUT_FILE = r'blah.txt'
OUTPUT_FILE = r'blah.TAB'

col_widths = []
with open(INPUT_FILE, "r") as fi:
    line = fi.readline()
    headers = line.split(sep='|')
    for h in headers:
        col_widths.append(len(h))

with open(INPUT_FILE) as fi:
    line = fi.readline()
    while line:
        cols = line.split(sep='|')
        line = fi.readline()
        index = 0
        for c in cols:
            if len(c) > col_widths[index]:
                col_widths[index] = len(c)
            index += 1

with open(INPUT_FILE) as fi:
    fo = open(OUTPUT_FILE, 'w')
    line = fi.readline()
    while line:
        tokens = line.split(sep='|')
        index = 0
        for t in tokens:
            if index == len(col_widths) - 1:
                t = t.rstrip('\r\n')
            ft = '{:<' + str(col_widths[index]) + '}'
            v = ft.format(t)
            fo.write(v + '|')
            index += 1
        fo.write('\r')
        line = fi.readline()
    fo.close()
TomServo
  • 7,248
  • 5
  • 30
  • 47