0

I have several .txt files with 140k+ lines each. They all have three types of data, which are a mix of string and floats: - 7 col - 14 col - 18 col

What is the best and fastest way to parse such data?

I tried to use numpy.genfromtxt with usecols=np.arange(0,7) but obviously cuts out the 14 and 18 col data.

# for 7 col data
load = np.genfromtxt(filename, dtype=None, names=('day', 'tod', 'condition', 'code', 'type', 'state', 'timing'), usecols=np.arange(0,7))

I would like to parse the data as efficiently as possible.

Ali R.
  • 433
  • 2
  • 4
  • 11

1 Answers1

0

The solution is rather simple and intuitive. We check if the number of columns in each row is equal to the specified number and append it to an array. For better analysis/modification of our data, we can then convert it to a Pandas DataFrame or Numpy as desired, below I show conversion to DataFrame. The number of columns in my dataset are 7, 14 and 18. I want my data labeled, so I can use Pandas' columns to label from an array.

import pandas as pd

filename = "textfile.txt"

labels_array1 = [] # 7 labels
labels_array2 = [] # 14 labels
labels_array3 = [] # 18 labels

with open(filename, "r") as f:
    lines = f.readlines()
    for line in lines:
        num_items = len(line.split())
        if num_items==7:
            array1.append(line.rstrip())
        elif num_items==14:
            array2.append(line.rstrip())
        elif num_items==18:
            array3.append(line.rstrip())
        else:
            print("Detected a line with different columns.", num_items)
df1 = pd.DataFrame([sub.split() for sub in array1], columns=labels_array1)
df2 = pd.DataFrame([sub.split() for sub in array2], columns=labels_array2)
df3 = pd.DataFrame([sub.split() for sub in array3], columns=labels_array3)
Ali R.
  • 433
  • 2
  • 4
  • 11