Remove unwanted non-printable characters from large CSV files with millions of records -in Python 3 or 2.7

Question

sample fileI receive large CSV files delimited with (comma or | or ^) with millions of records.
Some of the fields have non-printable character like CR|LF which translated as end of field. This is in windows10.

I need to write python to go thru the file and remove CR|LF in the fields. However, I cant remove all because then lines will be merged.

I have gone thru several postings on here on how to remove non-printable. My thought to write a panda dataframe, then check every field for CR|LF and remove it. It seems a bit complicated. If you have a quick code how to do this, it will be great help.

Thanks in advance.

Sample file:

record1, 111. texta, textb CR|LF
record2, 111. teCR|LF
xta, textb CR|LF
record3, 111. texta, textb CR|LF

Sample output file should be:

record1, 111. texta, textb CR|LF
record2, 111. texta, textb CR|LF
record3, 111. texta, textb CR|LF

CR = carriage Return = x0d LF = Line Feed = x0a

On what basis would one decide which non-printable characters to remove? — Scott Hunter, Sep 18 '19 at 16:44
At this point, I need to remove CR|LF and all \xx0 (\xa0, \xb0, \xc0......) unicodedata — dan1197, Sep 18 '19 at 16:59
It might be more worthwhile to fix whatever program writes garbage output like that. — Jan Christoph Terasa, Sep 20 '19 at 05:16
@jan Here is downloadable link to test file. Unfortunately, it is not possible to fix the source data. https://drive.google.com/open?id=1p2jbchpd-uwuTOb9RzhGcHPxdMqWxplB — dan1197, Sep 20 '19 at 20:00

alessiosavi · Answer 1 · 2019-09-18T22:04:21.500

0

EDIT -- Wed Sep 18, 23.46 UTC+2

NOTE:

record1|111. texta|textb|111CR|LF
record2|111. teCR|LF 
xta|text|111CR|LF
record3|111. texta|textb|111CR|LF

This is the file that we are going to analyze

Due to the fact that we have a csv file, we can be quite sure that the datatype will be consistent among the rows for a given columns.

Due to this assumption, we can regexp the delimiter in scope (CL|RF) with a regexp (\|\d+CR\|LF).

If the regexp is not matched, we can remove the carriage return, cause is not the end of the line.

import pandas as pd
from io import StringIO
import re

# Verify that the pattern `|ARBITRARY NUMBER + CR|LF`
pattern = re.compile("\|\d+CR\|LF")
# Open the file and read the content
with open("a.txt") as f:
    data = f.readlines()
not_parsed = data.copy()
_max = len(data)
i = 0
parsed_data = []
# Iterate the data
while i < _max:
    # Remove unnecessary new line
    line = data[i].strip()
    # If the pattern does not match, we need to strip the carriage return
    if not pattern.search(line) and i + 1 < _max:
        line = line.replace("CR|LF", "").strip()
        line = line + data[i + 1].strip()
        i += 1
    line = line
    if line != "":
        parsed_data.append(line)
    i += 1
# Comment
data = [line.replace("CR|LF", "") for line in parsed_data]
# Load the csv using pandas

print("DATA BEFORE -> {}".format("".join(not_parsed)))
print("DATA NOW -> {}".format("\n".join(data)))
DATA = StringIO("\n".join(data))
df = pd.read_csv(DATA, delimiter="|")

How does this remove the unwanted CL|RF, but leave the wanted ones?

The file will not be modified, instead it will be saved as a list of single ('line by line'). Then we are going to replace the carriage return only when the regexp not match and load as a dataframe

NOTE:
Tested on Linux that use \n as new line

edited Sep 18 '19 at 22:04

answered Sep 18 '19 at 17:13

alessiosavi

2,753
2
19
38

How does this remove the unwanted CL|RF, but leave the wanted ones? – Scott Hunter Sep 18 '19 at 17:47
Uh sorry Sir! I completely miss the point! Please, upload some lines of the line. I think that in this case we have to tune the algorithm for the specific purpose. We can choose to replace or not the "CR|LF" based on the '\|\d+CR\|LF' regexp. The example with the string instead of the number as the file drive me out of the road :D. Please upload just a few lines for verify – alessiosavi Sep 18 '19 at 21:16
3255|8|2|24582|1|11826 41st, Dickinson|1|39|15855|Y|559|2018-10-15 10:13:33.000|890|2019-02-15 10:54:56.000|16482|Stark|58601 3383|13|1|22386|1|4413 Hwy 19 Esmond, NJ 58332|1|39|15880|Y|461|2017-10-11 11:40:15.000|||16440|Bson|332 3383|16|1|22386|1|4413 Hwy 19 Esmond, NJ 58332|1|39|15880|Y|461|2017-10-11 11:40:15.000|||16440|Bson|332 3460|7|1|2700|1|NW 1/4 Section 19-155-84|1|39|15817|Y|832|2015-10-09 13:20:44.000|832|2015-10-09 13:21:44.000|16488| |58722 – dan1197 Sep 18 '19 at 21:37
@alessiosavi Thanks much. Test on windows 10 cmd errors: >>> pattern = re.compile("\|\d+CR\|LF") >>> # Open the file and read the content ... with open("text3.txt") as f: ... data = f.readlines() ... _max = len(data) File "", line 4 _max = len(data) ^ SyntaxError: invalid syntax >>> i = 0 >>> parsed_data = [] >>> # Iterate the data ... while i < _max: ... # Remove unnecessary new line ... line = data[i].strip() ... # If the pattern does not match, we need to strip the carriage return ... if not pattern.search(line): – dan1197 Sep 18 '19 at 22:18
@Scott Hunter that is the main challenge I am facing. The csv file delimited with '|'. So, I read the CSV using panda and provided delimiter and give column names. But if I print the column names, it splits the records with CR|LF to 2 records! So, I need to have logic to work between delimiters and find CR|LF to remove – dan1197 Sep 18 '19 at 22:41
Save in a file and run as `python file.py` – alessiosavi Sep 18 '19 at 22:42
@alessiosavi Thanks File "pandas\_libs\parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas\_libs\parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 22 fields in line 3, saw 28 – dan1197 Sep 19 '19 at 01:34
@alessiosavi CR = carriage Return = x0d LF = Line Feed = x0a So, your code stumbling as it sees them as separate lines because of line feed. – dan1197 Sep 20 '19 at 05:11

Jan Christoph Terasa · Answer 2 · 2019-09-22T14:49:24.757

0

Run this script (e.g. name it fix_csv.py) on your file to sanitize it:

#!/usr/bin/env python3

import sys
import os

if len(sys.argv) < 3:
    sys.stderr.write('Please give the input filename and an output filename.\n')
    sys.exit(1)

# set the correct number of fields
nf = 3
# set the delimiter
delim = ','

inpf = sys.argv[1]
outf = sys.argv[2]

newline = os.linesep

with open(inpf, 'r') as inf, open(outf, 'w') as of:
    cache = []
    for line in inf:
        line = line.strip()
        ls = line.split(delim)
        if len(ls) < nf or cache:
            if not cache:
                cache = cache + ls
            elif cache:
                cache[-1] += ls[0]
                cache = cache + ls[1:]
            if len(cache) == nf:
                of.write(f'{delim}'.join(cache) + newline)
                cache = []
        else:
            of.write(line + newline)

Call it like

./fix_csv input.dat output.dat

Output:

record1, 111. texta, textb
record2, 111. texta, textb
record3, 111. texta, textb

edited Sep 22 '19 at 14:49

answered Sep 20 '19 at 05:34

Jan Christoph Terasa

5,781
24
34

https://stackoverflow.com/users/1170207/jan-christoph-terasa: thanks. It removed all LF and merged all records into one. Would it be possible to send sample text file with 6 lines? – dan1197 Sep 20 '19 at 18:52
This does not look anything like your question. Did you adapt the constants to match your CSV file? – Jan Christoph Terasa Sep 22 '19 at 14:45
This is the sample file, an extract from a very large file that I am working with. You can see the records are divided by Line Feed in some fields. – dan1197 Sep 24 '19 at 04:31
I noticed the linefeed always in the same text field! – dan1197 Sep 24 '19 at 19:19
Have you run the code and can you confirm it fixes the issue? – Jan Christoph Terasa Sep 25 '19 at 17:06

Remove unwanted non-printable characters from large CSV files with millions of records -in Python 3 or 2.7

2 Answers2

EDIT -- Wed Sep 18, 23.46 UTC+2

Linked