1

I am trying to read file using spark reader. Spark reader splits the records in the file when it encounters the control characters like ^M, ^H, ^O, ^P.

To debug the issue I am trying to manually removing the control characters the file and testing record length with spark shell.

I tried to remove all control characters and check the record length:

sed -i 's/^[:print:]/ /g' <filename>

I found that it is also replacing punctuation characters like ? in space. Please suggest the command that will helpful to replace all control characters into space.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Learner
  • 21
  • 5

1 Answers1

1

The ^ when used outside of a bracket expression mean start of a string. The [:print:] POSIX character class outside of a bracket expression does not match any printable chars, it matches a colon, r, p, n, i, t chars.

You can use

sed -i 's/[^[:print:]]/ /g' < filename>

It will replace every non-printable char with a literal space char.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563