Remove non-ASCII characters interpreted as EOF from text file

Question

Following up on my earlier question here: Row limit in read.table.ffdf?

I have a text file with >285 million records, but about two-thirds of the way through there are several non-ASCII characters that are being interpreted by AWK as well as several R packages (ff, data.table) as EOF bytes. It appears that the characters were originally entered as degree signs, but appear in text editors as boxes (see example here). When I try to read in the text file using these methods it just stops when it encounters the first character, with no error messages as if it's complete.

For now I was able to open the file in a text editor to remove these characters. But this is not a long-term solution for this dataset given its size; I need to be able to remove or bypass them without having to open the whole file. I've tried using the quote option in R, and tried replacing all non-ASCII and 'CTRL-M' characters specifically during an awk import, but the read process always stops at the first character. Any solutions? I'm using R and awk now, but am open to other options (python?). Thanks!

figure out a way to grab that offending line, and then pipe that output thru `xxd ` so we can see what the actual data is (just the line with non-ASCIIs). Add that data to your Q using the `{}` tool at the top left of the edit tool. As is, we're left to guess and play 20 questions. Good luck. — shellter, May 24 '16 at 02:01
Does `gawk -v BINMODE=3 '{gsub(/[[:cntrl:]]/,"")}1` remove them? — Ed Morton, May 24 '16 at 03:18

score 0 · Answer 1 · answered May 25 '16 at 19:15

0

gawk -v BINMODE=3 '{gsub(/[[:cntrl:]]/,"")}1

will remove them.

answered May 25 '16 at 19:15

Ed Morton

188,023
17
78
185

Remove non-ASCII characters interpreted as EOF from text file

1 Answers1