How to fix or separate in another database the bad_lines on Pandas Python

Question

I have the following code for unzip and concatenate .csv files of a directory into a new merged .csv file named base_flash.csv

def merge(self):
        
        file_list = [self.unzipDir + '\\' + f for f in os.listdir(self.unzipDir) if f.startswith('relatorio')]
        csv_list = []
        for file in sorted(file_list):
            csv_list.append(pd.read_csv(file, sep = ';', dtype=str, index_col=False, encoding='ansi', on_bad_lines='warn').assign(File_Name = os.path.basename(file)))
        csv_merged = pd.concat(csv_list, ignore_index=True)
        csv_merged.to_csv('base_flash.csv', index=False, sep = ';', encoding='ansi')
        print(csv_merged)

Some files have bad lines with more fields than expected:

b'Skipping line 331: expected 68 fields, saw 70\nSkipping line 343: expected 68 fields, saw 70\nSkipping line 468: expected 68 fields, saw 70\nSkipping line 484: expected 68 fields, saw 70\n'
b'Skipping line 327: expected 68 fields, saw 70\nSkipping line 343: expected 68 fields, saw 70\nSkipping line 415: expected 68 fields, saw 70\n'
b'Skipping line 131: expected 68 fields, saw 70\n'
b'Skipping line 518: expected 68 fields, saw 70\nSkipping line 558: expected 68 fields, saw 70\n'
b'Skipping line 124: expected 68 fields, saw 69\nSkipping line 137: expected 68 fields, saw 69\n'
b'Skipping line 187: expected 68 fields, saw 70\nSkipping line 259: expected 68 fields, saw 70\n'

I have found the problem in the CSV file which is: there is more ;; than should have.

The columns outplaced

There is a way to make a separated dataframe to deal with these bad lines? Or maybe even remove a pair of semicolons to restore the column in place?

hard to deal with garbage data like that... (garbage in -> garbage out) If it's predictable you might set the separator to ";" or ";;" (not familiar with pandas) What I usually do when combining files like these is to read/scan the whole file(s) and check for the row with the largest amount of fields (columns). Then read those files into an array that has that size. You are getting the max number of fields, and then padding if short. — pcalkins, Jun 22 '22 at 20:47

jch · Answer 1 · 2022-06-22T21:00:56.303

1

Starting with pandas 1.4.0, read_csv() delivers capability that allows you to handle these situations in a more graceful and intelligent fashion by allowing a callable to be assigned to on_bad_lines=. This link gives some guidance on how to do that: Pandas dataframe read_csv on bad data

edited Jun 22 '22 at 21:00

answered Jun 22 '22 at 20:55

jch

3,600
1
15
17

How to fix or separate in another database the bad_lines on Pandas Python

1 Answers1