-2

I have some text file that I want to load into my python code, but the format of the txt file is not suitable.

Here is what it contains:

SEQ  MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLASWNY
SS3  CCCHHHHHHHHHHHHCCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
     95024445656543114678678999999999999999888889998886
SS8  CCHHHHHHHHHHHHHHCCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
     96134445555554311253378999999999999999999999999987
SA   EEEbBBBBBBBBBBbEbEEEeeEeBeEbBEEbbEeBeEbbeebBbBbBbb
     41012123422000000103006262214011342311110000030001
TA   bhHHHHHHHHHHHHHgIihiHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
     00789889988663201010099999999999999999898999998741
CD   NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
     54433221111112221122124212411342243234323333333333

I want to convert it into panda Dataframe to have SEQ SS4 SA TA CD SS8 as columns of the DataFrame and the line next to them as the rows. Like this:enter image description here

I tried pd.read_csv but it doesn't give me the result I want.

Thank you !

Rayanaay
  • 85
  • 1
  • 9

3 Answers3

1

Steps

  1. Use pd.read_fwf() to read files in a fixed-width format.
  2. Fill the missing values with the last available value by df.ffill().
  3. Assign group number gp for the row number in the output using a groupby-cumcount construct.
  4. Move gp=(0,1) to columns by df.pivot, and then transpose again into the desired output.

Note: this solution works with arbitrary (includes zero, and of course not too many) consecutive lines with omitted values in the first column.

Code

# data (3 characters for the second column only)
file_path = "/mnt/ramdisk/input.txt"
df = pd.read_fwf(file_path, names=["col", "val"])

# fill the blank values
df["col"].ffill(inplace=True)
# get correct row location
df["gp"] = df.groupby("col").cumcount()
# pivot group (0,1) to columns and then transpose. 
df_ans = df.pivot(index="col", columns="gp", values="val").transpose()

Result

print(df_ans)  # show the first 3 characters only

col   CD   SA  SEQ  SS3  SS8   TA
gp                               
0    NNN  EEE  MSS  CCC  CCH  bhH
1    544  410  NaN  950  961  007

Then you can save the resulting DataFrame using df_ans.to_csv().

Bill Huang
  • 4,491
  • 2
  • 13
  • 31
1

To read a text file using pandas.read_csv() method, the text file should contain data separated with comma.

 SEQ, SS3, ....
 MSSSSWLLLSLVAVTAAQSTIEEQ..., CCCHHHHHHHHHHHHCCCCCCHHHHHHH.....
Prakriti Shaurya
  • 187
  • 1
  • 3
  • 14
0

You can use this script to load the .txt file to DataFrame and save it as csv file:

import pandas as pd


data = {}
with open('<your file.txt>', 'r') as f_in:
    for line in f_in:
        line = line.split()        
        if len(line) == 2:
            data[line[0]] = [line[1]]

df = pd.DataFrame(data)
print(df)
df.to_csv('data.csv', index=False)

Saves this CSV:

enter image description here

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91