-3

sample fileI receive large CSV files delimited with (comma or | or ^) with millions of records.
Some of the fields have non-printable character like CR|LF which translated as end of field. This is in windows10.

I need to write python to go thru the file and remove CR|LF in the fields. However, I cant remove all because then lines will be merged.

I have gone thru several postings on here on how to remove non-printable. My thought to write a panda dataframe, then check every field for CR|LF and remove it. It seems a bit complicated. If you have a quick code how to do this, it will be great help.

Thanks in advance.

Sample file:

record1, 111. texta, textb CR|LF
record2, 111. teCR|LF
xta, textb CR|LF
record3, 111. texta, textb CR|LF

Sample output file should be:

record1, 111. texta, textb CR|LF
record2, 111. texta, textb CR|LF
record3, 111. texta, textb CR|LF

CR = carriage Return = x0d LF = Line Feed = x0a

dan1197
  • 13
  • 1
  • 1
  • 4

2 Answers2

0

EDIT -- Wed Sep 18, 23.46 UTC+2

NOTE:

record1|111. texta|textb|111CR|LF
record2|111. teCR|LF 
xta|text|111CR|LF
record3|111. texta|textb|111CR|LF

This is the file that we are going to analyze


Due to the fact that we have a csv file, we can be quite sure that the datatype will be consistent among the rows for a given columns.

Due to this assumption, we can regexp the delimiter in scope (CL|RF) with a regexp (\|\d+CR\|LF).

If the regexp is not matched, we can remove the carriage return, cause is not the end of the line.

import pandas as pd
from io import StringIO
import re

# Verify that the pattern `|ARBITRARY NUMBER + CR|LF`
pattern = re.compile("\|\d+CR\|LF")
# Open the file and read the content
with open("a.txt") as f:
    data = f.readlines()
not_parsed = data.copy()
_max = len(data)
i = 0
parsed_data = []
# Iterate the data
while i < _max:
    # Remove unnecessary new line
    line = data[i].strip()
    # If the pattern does not match, we need to strip the carriage return
    if not pattern.search(line) and i + 1 < _max:
        line = line.replace("CR|LF", "").strip()
        line = line + data[i + 1].strip()
        i += 1
    line = line
    if line != "":
        parsed_data.append(line)
    i += 1
# Comment
data = [line.replace("CR|LF", "") for line in parsed_data]
# Load the csv using pandas

print("DATA BEFORE -> {}".format("".join(not_parsed)))
print("DATA NOW -> {}".format("\n".join(data)))
DATA = StringIO("\n".join(data))
df = pd.read_csv(DATA, delimiter="|")

How does this remove the unwanted CL|RF, but leave the wanted ones?

The file will not be modified, instead it will be saved as a list of single ('line by line'). Then we are going to replace the carriage return only when the regexp not match and load as a dataframe

NOTE:
Tested on Linux that use \n as new line

enter image description here

alessiosavi
  • 2,753
  • 2
  • 19
  • 38
  • How does this remove the unwanted CL|RF, but leave the wanted ones? – Scott Hunter Sep 18 '19 at 17:47
  • Uh sorry Sir! I completely miss the point! Please, upload some lines of the line. I think that in this case we have to tune the algorithm for the specific purpose. We can choose to replace or not the "CR|LF" based on the '\|\d+CR\|LF' regexp. The example with the string instead of the number as the file drive me out of the road :D. Please upload just a few lines for verify – alessiosavi Sep 18 '19 at 21:16
  • 3255|8|2|24582|1|11826 41st, Dickinson|1|39|15855|Y|559|2018-10-15 10:13:33.000|890|2019-02-15 10:54:56.000|16482|Stark|58601 3383|13|1|22386|1|4413 Hwy 19 Esmond, NJ 58332|1|39|15880|Y|461|2017-10-11 11:40:15.000|||16440|Bson|332 3383|16|1|22386|1|4413 Hwy 19 Esmond, NJ 58332|1|39|15880|Y|461|2017-10-11 11:40:15.000|||16440|Bson|332 3460|7|1|2700|1|NW 1/4 Section 19-155-84|1|39|15817|Y|832|2015-10-09 13:20:44.000|832|2015-10-09 13:21:44.000|16488| |58722 – dan1197 Sep 18 '19 at 21:37
  • @alessiosavi Thanks much. Test on windows 10 cmd errors: >>> pattern = re.compile("\|\d+CR\|LF") >>> # Open the file and read the content ... with open("text3.txt") as f: ... data = f.readlines() ... _max = len(data) File "", line 4 _max = len(data) ^ SyntaxError: invalid syntax >>> i = 0 >>> parsed_data = [] >>> # Iterate the data ... while i < _max: ... # Remove unnecessary new line ... line = data[i].strip() ... # If the pattern does not match, we need to strip the carriage return ... if not pattern.search(line): – dan1197 Sep 18 '19 at 22:18
  • @Scott Hunter that is the main challenge I am facing. The csv file delimited with '|'. So, I read the CSV using panda and provided delimiter and give column names. But if I print the column names, it splits the records with CR|LF to 2 records! So, I need to have logic to work between delimiters and find CR|LF to remove – dan1197 Sep 18 '19 at 22:41
  • Save in a file and run as `python file.py` – alessiosavi Sep 18 '19 at 22:42
  • @alessiosavi Thanks File "pandas\_libs\parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas\_libs\parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 22 fields in line 3, saw 28 – dan1197 Sep 19 '19 at 01:34
  • @alessiosavi CR = carriage Return = x0d LF = Line Feed = x0a So, your code stumbling as it sees them as separate lines because of line feed. – dan1197 Sep 20 '19 at 05:11
0

Run this script (e.g. name it fix_csv.py) on your file to sanitize it:

#!/usr/bin/env python3

import sys
import os

if len(sys.argv) < 3:
    sys.stderr.write('Please give the input filename and an output filename.\n')
    sys.exit(1)

# set the correct number of fields
nf = 3
# set the delimiter
delim = ','

inpf = sys.argv[1]
outf = sys.argv[2]

newline = os.linesep

with open(inpf, 'r') as inf, open(outf, 'w') as of:
    cache = []
    for line in inf:
        line = line.strip()
        ls = line.split(delim)
        if len(ls) < nf or cache:
            if not cache:
                cache = cache + ls
            elif cache:
                cache[-1] += ls[0]
                cache = cache + ls[1:]
            if len(cache) == nf:
                of.write(f'{delim}'.join(cache) + newline)
                cache = []
        else:
            of.write(line + newline)

Call it like

./fix_csv input.dat output.dat

Output:

record1, 111. texta, textb
record2, 111. texta, textb
record3, 111. texta, textb
Jan Christoph Terasa
  • 5,781
  • 24
  • 34