0

l have several csv files that l process with pandas. l would like to remove rows that have more than 3 columns.

How can l proceed ?

Thank you

Edit1

  id                                       ocr  raw_value   
4a82a357-99e7-49e6-85b6-b2f6a27b8d5f    OMNIPAGE    Terms        em
8b549fef-0cda-4af5-8239-35153c33ffbc    OMNIPAGE    price   
52ffe66a-b1ab-4b22-9b26-c298d53c951c    OMNIPAGE    Renseignements  
507a0d96-9481-4b3f-8c35-f16588bedc0b    OMNIPAGE    pour    
52e171dc-8d22-4162-b748-692b2fc11659    OMNIPAGE    Client  
c40a7e9f-1ec4-4cac-87e8-02ed0f335fe9    OMNIPAGE    5           client
4a936ed7-c082-4f46-9fa1-761a1525e2df    OMNIPAGE    SAS 
4b78130e-b099-400c-b7bf-6470e0519783    OMNIPAGE    des 
4d5c6297-1c79-42f9-b4ea-929a9abfb3f7    OMNIPAGE    431 
829d8bf5-b251-4bb1-82d8-0e912ab64e8e    OMNIPAGE    59  102
5ed5b74d-efc5-49fa-9b12-dbe3ca88995f    OMNIPAGE    votre   votre
58d26125-1120-4328-83c4-7f5b0135184d    OMNIPAGE    Crécy,  Crécy,

In this example : first row and 6th row to be removed they have extra column em , client

vincent
  • 41
  • 7

2 Answers2

3

If possible error is the extra column only. Then pass this in pd.read_csv.

error_bad_lines=False

error_bad_lines : boolean, default True Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines” will dropped from the DataFrame that is returned. (Only valid with C parser)

Rahul Chawla
  • 1,048
  • 10
  • 15
0

CSV files are supposed to have a fixed number of columns. Pandas is not a CSV format validator (even if it is able to handle a few mistakes). If you have a improperly formatted CSV (in your case, with variable number of rows in each column), you should prevalidate it before feeding it to Pandas.

For example: https://pypi.python.org/pypi/csvvalidator

Or code to do it yourself is rather trivial.

Guillaume
  • 5,497
  • 3
  • 24
  • 42