0

Question: How can I split a list into two sublists where the elements are separated by a tab in the element?

Context: I want to read a .txt file delimited by tabs into a Pandas DataFrame. The files look something like:

Column1 \t 123
Column2 \t
Column3 \t text

Meaning that each line has one column followed by one tab and then one value of the column (sometimes no value).

My idea was to read the file and save each line as an element of a list, then split the list into two keeping the first part before the tab as one list and the second part after the tab as another. Then build my dataframe from there.

for file in txt_files:  #iterate over all files
  f = open(file)        #open each file individually
  lines = f.readlines() #read each line as an element into a list 
  f.close()

#make sublists columns and values
Scott Skiles
  • 3,647
  • 6
  • 40
  • 64

2 Answers2

1

You can read your files into a dataframe like this:

import pandas as pd

# Empty list to store dataframe rows
df_rows = []

# Read all text files
for tf in text_files:
    # For each file
    with open(tf) as f:
        # Empty dictionary to store column names and values
        df_dict = {}

        # For each line
        for line in f:
            # Split by tab
            k, v = line.split('\t')

            # Column name as key, value as  value
            df_dict[k] = v

        # Add the dictionary to list
        df_rows.append(df_dict)

# Read the list of dictionaries as a dataframe
df = pd.DataFrame(df_rows)

# Preview dataframe
df.head()
Abhinav Sood
  • 799
  • 6
  • 23
1

If I understand correctly, you can just transpose the dataframe read_csv will give you with delimiter='\t'.

Demo:

>>> from io import StringIO           
>>> import pandas as pd                                                         
>>>                                                                             
>>> file = StringIO('''Column1\t123 
...: Column2\t 
...: Column3\ttext''')                                                      
>>>                                                                             
>>> df = pd.read_csv(file, delimiter='\t', index_col=0, header=None).T                                                                
>>> df                                                                          
>>>
0 Column1 Column2 Column3
1     123     NaN    text

(If your delimiter is really ' \t ' then use delimiter=' \t ' and engine='python').

timgeb
  • 76,762
  • 20
  • 123
  • 145