1

I am trying to adjust some data but the raw data has for each row a different number of columns. I know there are some similar asked questions like

Reading a CSV file with irregular number of columns using Pandas which almost describes my problem but in that case the user has a fix amount of maximum columns/rows. In my case I don't know the maximum number of columns (and in which line it is).

structure of my dataset

The dataset (which you can find here: https://docs.google.com/spreadsheets/d/1tXigrvlofn5paoB_-JvPt7Ent2ZNZmzmlk6FZHhzNSA/edit?usp=sharing) has a header line, but the header is useless and I won't import it. After the 8th column the following 6 columns repeat themselves continuously

1 car - - 299.80 115.12 etc.
2 Heavy Vehicle - - 333.12 89.02 etc.

1st approach

With (How to know CSV line count before loading in python?)

dataname = 'csv_file.csv'
lines = sum(1 for line in open(dataname))

I can count the number of rows without loading the csv in python but I haven't found a similar way to count the number of columns without loading the csv file.

2nd approach

I found another approach (which almost solves my issue) in Python/Pandas –– ParserError: Error tokenizing data. C error: Expected x fields in line i, saw y but I get an error

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1701 fields in line 3, saw 2829

when I am using pandas data_raw2 = pd.read_csv(dataname, delimiter=';', decimal='.', header=0). The solution in this thread is to use the

with open(dataname, 'r', encoding='utf-8') as f:
    bla = pd.read_csv(f,  engine='python').dropna(how='all', axis=1)

but this returns a conjoined dataframe with the given number of rows but only one column (see picture below).

given dataframe after using 2nd approach

QuagTeX
  • 112
  • 1
  • 12
  • After 2nd approach you can do: `bla[0].str.split(';', expand=True)` – Sayandip Dutta Jul 09 '21 at 08:43
  • @Cyttorak that line raises the following error: raise `KeyError(key) from err KeyError: 0`. Could it be caused because I am using python 3.8.6 version? – QuagTeX Jul 09 '21 at 09:01
  • 1
    No, I assumed `0` to be the first column name. use `bla[bla.columns[0]]` instead of `bla[0]` – Sayandip Dutta Jul 09 '21 at 09:14
  • 1
    Thumbs up! That worked! Now I can continue trying to transpose and rearrange the array. – QuagTeX Jul 09 '21 at 09:33
  • With a basic text read you can read the file line by line, splitting each, and collecting the results in a list of lists. Once you've loaded all, you know the number of lines, and the maximum length of each. Go back through the list, and expand each sublist as needed. `np.genfromtxt` collects such a list of lists, and makes the array at the end. – hpaulj Jul 09 '21 at 19:31

0 Answers0