Pandas: How to find line breaks in a DF?

Question

I need to find all representations of line breaks to circumvent a problem created by AzureML's designers, which is as follows:

By default (support_multi_line=False), all line breaks,
including those in quoted field values,
will be interpreted as a record break.

Consequently, this design choice is breaking my DF by inflating its records and creating errors in my pipeline.

I have attempted this:

df.replace(to_replace=[r"\\t|\\n|\\r", "\t|\n|\r"], value=["", ""], regex=True, inplace=True)

But it is not working -- line breaks are still being found in my DF -- what else should I be looking for?

score 2 · Answer 1 · answered Nov 09 '21 at 02:01

2

df.replace() searched for whole values in all rows and columns of the dataframe and replaces those values with the specified values. It doesn't replace parts of strings.

You're looking for df[column].str.replace:

df[column] = df[column].str.replace('[\n|\r|\t]|\\\\[nrt]', '', regex=False)

answered Nov 09 '21 at 02:01

1

Apologies for the delay -- just got around to testing. This strategy takes care of fewer line breaks than the above `[r"\\t|\\n|\\r", "\t|\n|\r"]` (i.e., the problem is slightly worse). – John Stud Nov 09 '21 at 03:23
Oh really? I'm surprised to hear that. Will you please send a sample of your data in the question, particularly the rows that contain the bad data? I'd be able to help better if I could see what data you're actually dealing with – Nov 09 '21 at 05:04

Pandas: How to find line breaks in a DF?

1 Answers1