I am trying to adjust some data but the raw data has for each row a different number of columns. I know there are some similar asked questions like
Reading a CSV file with irregular number of columns using Pandas which almost describes my problem but in that case the user has a fix amount of maximum columns/rows. In my case I don't know the maximum number of columns (and in which line it is).
structure of my dataset
The dataset (which you can find here: https://docs.google.com/spreadsheets/d/1tXigrvlofn5paoB_-JvPt7Ent2ZNZmzmlk6FZHhzNSA/edit?usp=sharing) has a header line, but the header is useless and I won't import it. After the 8th column the following 6 columns repeat themselves continuously
1 | car | - | - | 299.80 | 115.12 | etc. |
2 | Heavy Vehicle | - | - | 333.12 | 89.02 | etc. |
1st approach
With (How to know CSV line count before loading in python?)
dataname = 'csv_file.csv'
lines = sum(1 for line in open(dataname))
I can count the number of rows without loading the csv in python but I haven't found a similar way to count the number of columns without loading the csv file.
2nd approach
I found another approach (which almost solves my issue) in Python/Pandas –– ParserError: Error tokenizing data. C error: Expected x fields in line i, saw y but I get an error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1701 fields in line 3, saw 2829
when I am using pandas data_raw2 = pd.read_csv(dataname, delimiter=';', decimal='.', header=0)
. The solution in this thread is to use the
with open(dataname, 'r', encoding='utf-8') as f:
bla = pd.read_csv(f, engine='python').dropna(how='all', axis=1)
but this returns a conjoined dataframe with the given number of rows but only one column (see picture below).