0

I've got a corrupt data frame with random header duplicates inside the data frame. How to ignore or delete these rows while loading the data frame?

Since this random header is in the data frame, pandas raise an error while loading. I would like to ignore this row while loading it with pandas. Or delete it somehow, before loading it with pandas.

The file looks like this:

col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
col1, col2, col3  <- this is the random copy of the header inside the dataframe
0, 1, 1
0, 0, 0
1, 1, 1

I want:

col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
0, 1, 1
0, 0, 0
1, 1, 1
vince
  • 47
  • 1
  • 6
  • 1
    if all data are numerics try : `pd.read_csv('file.csv').apply(pd.to_numeric,errors='coerce').dropna().reset_index(drop=True)` ? – anky Sep 05 '19 at 15:39
  • Possible duplicate of [Pandas dataframe read\_csv on bad data](https://stackoverflow.com/questions/33440805/pandas-dataframe-read-csv-on-bad-data) – bamdan Sep 05 '19 at 15:55

2 Answers2

0

Throw in na_filter = False to typecast your columns into strings. Then locate all rows with bad data then filter them out your dataframe.

>>> df = pd.read_csv('sample.csv', header = 0, na_filter = False)
>>> df
   col1  col2  col3
0     0     1     1
1     0     0     0
2     1     1     1
3  col1  col2  col3
4     0     1     1
5     0     0     0
6     1     1     1
>>> type(df.iloc[0,0])
<class 'str'>

Now that you parsed your data in each column as strings, locate all col1, col2, and col3 values in your df, create a new column if you find them each column using np.where() as such:

>>> df['Tag'] = np.where(((df['col1'] != '0') & (df['col1'] != '1')) & ((df['col2'] != '0') & (df['col2'] != '1')) & ((df['col3'] != '0') & (df['col3'] != '1')), ['Remove'], ['Don\'t remove'])
>>> df
   col1  col2  col3           Tag
0     0     1     1  Don't remove
1     0     0     0  Don't remove
2     1     1     1  Don't remove
3  col1  col2  col3        Remove
4     0     1     1  Don't remove
5     0     0     0  Don't remove
6     1     1     1  Don't remove

Now, filter out the one tagged as Removed in the Tag column using isin().

>>> df2 = df[~df['Tag'].isin(['Remove'])]
>>> df2
  col1 col2 col3           Tag
0    0    1    1  Don't remove
1    0    0    0  Don't remove
2    1    1    1  Don't remove
4    0    1    1  Don't remove
5    0    0    0  Don't remove
6    1    1    1  Don't remove

Drop the Tag column:

>>> df2 = df2[['col1', 'col2', 'col3']]
>>> df2
  col1 col2 col3
0    0    1    1
1    0    0    0
2    1    1    1
4    0    1    1
5    0    0    0
6    1    1    1

Finally typecast your dataframe into int, if you need it to be an integer:

>>> df2 = df2.astype(int)
>>> df2
   col1  col2  col3
0     0     1     1
1     0     0     0
2     1     1     1
4     0     1     1
5     0     0     0
6     1     1     1
>>> type(df2['col1'][0])
<class 'numpy.int32'>

Note: If you want standard index use:

>>> df2.reset_index(inplace = True, drop = True)
>>> df2
   col1  col2  col3
0     0     1     1
1     0     0     0
2     1     1     1
3     0     1     1
4     0     0     0
5     1     1     1
Joe
  • 879
  • 2
  • 6
  • 15
0

You just need to do the following:
Assuming df_raw is your original dataframe with the column headers present both as the column names and repeating in several other rows, your corrected dataframe is df.

# Filter out only the rows without the headers in them.
headers = df_raw.columns.tolist()
df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)

Assumption:
- We assume that the appearance of the first column header means that row has to be dropped.

In Detail
Now a detailed code block for anyone to
- create the data,
- write it into a csv file,
- load it as a dataframe, and then
- remove rows that are headers.

import numpy as np
import pandas as pd

# make a csv file to load as dataframe
data = '''col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1
col1, col2, col3
0, 1, 1
0, 0, 0
1, 1, 1'''
# Write the data to a csv file
with open('data.csv', 'w') as f:
    f.write(data)
# Load your data with header=None
df_raw = pd.read_csv('data.csv', header=None)

# Declare which row to find the header data: 
#    assuming the top one, we set this to zero.
header_row_number = 0

# Read in columns headers
headers = df_raw.iloc[header_row_number].tolist()

# Set new column headers
df_raw.columns = headers

# Filter out only the rows without the headers in them
# We assume that the appearance of the 
# first column header means that row has to be dropped
# And reset index (and drop the old index column)
df = df_raw[df_raw[headers[0]]!=headers[0]].reset_index(drop=True)
df

Solution Output

CypherX
  • 7,019
  • 3
  • 25
  • 37