How do I replace sub-strings including non-printable characters in python?

Question

I need to automate the import a few large text files into my python script for some data wrangling and analysis. I need to remove/remedy specific instances of non-printable characters in order to stop pandas read_csv() splitting strings into new rows. I'm using Jupyter Notebook 5.0.0 and this is the latest version I will have available.

I have multiple text files with 800,000+ rows of strings over 500 characters in length. These strings contain an array of characters some of which are being handled in an undesirable way due to the insertion of '\n' mid string for some rows.

I have got round this problem using PowerShell to run against each file as a short-term manual fix using the following code:

#Encoding Unicode
(Get-Content file.txt -Raw).replace('`nabc', 'abc') | Set-Content Newfile.txt

I've tried using the open(filename.txt, 'rb').read() method with line by line outputs but the data feed is too large. I have increased the data feed in the jupyter_notebook_config.py file incrementally but it just ends up making Jupyter unresponsive and crashes without producing the goodies.

I have tried:

import pandas as pd

pd.read_csv('filename.txt', sep='\t', header=None, linetermintator='~', encoding='utf-16')

Basically the text file has already split the row at the '\n' point, pandas won't override the format or read it as raw.

I tested your read_csv with a file of the following content: `1\t\r\n2~3\t4` and it works correctly for pandas 0.24.2 and python 3.7, both for standard and utf-16 encoding. — Stef, Jul 16 '19 at 16:17
if you dataframe entries look like `1\t\n2`, you can clean them with `.str.replace('\s*',' ')` — Stef, Jul 16 '19 at 16:32

How do I replace sub-strings including non-printable characters in python?

0 Answers0