Spark reader splits records when encounters control characters

Question

I am trying to read file using spark reader. Spark reader splits the records in the file when it encounters the control characters like ^M, ^H, ^O, ^P.

To debug the issue I am trying to manually removing the control characters the file and testing record length with spark shell.

I tried to remove all control characters and check the record length:

sed -i 's/^[:print:]/ /g' <filename>

I found that it is also replacing punctuation characters like ? in space. Please suggest the command that will helpful to replace all control characters into space.

Have you tried `tr '[:cntrl:]' ' ' – oguz ismail Dec 27 '20 at 11:49 — oguz ismail, Dec 27 '20 at 11:49

score 1 · Answer 1 · answered Dec 27 '20 at 13:01

The ^ when used outside of a bracket expression mean start of a string. The [:print:] POSIX character class outside of a bracket expression does not match any printable chars, it matches a colon, r, p, n, i, t chars.

You can use

sed -i 's/[^[:print:]]/ /g' < filename>

It will replace every non-printable char with a literal space char.

Spark reader splits records when encounters control characters

1 Answers1