0

Following up on my earlier question here: Row limit in read.table.ffdf?

I have a text file with >285 million records, but about two-thirds of the way through there are several non-ASCII characters that are being interpreted by AWK as well as several R packages (ff, data.table) as EOF bytes. It appears that the characters were originally entered as degree signs, but appear in text editors as boxes (see example here). When I try to read in the text file using these methods it just stops when it encounters the first character, with no error messages as if it's complete.

For now I was able to open the file in a text editor to remove these characters. But this is not a long-term solution for this dataset given its size; I need to be able to remove or bypass them without having to open the whole file. I've tried using the quote option in R, and tried replacing all non-ASCII and 'CTRL-M' characters specifically during an awk import, but the read process always stops at the first character. Any solutions? I'm using R and awk now, but am open to other options (python?). Thanks!

Community
  • 1
  • 1
Michel
  • 1
  • 1
  • figure out a way to grab that offending line, and then pipe that output thru `xxd ` so we can see what the actual data is (just the line with non-ASCIIs). Add that data to your Q using the `{}` tool at the top left of the edit tool. As is, we're left to guess and play 20 questions. Good luck. – shellter May 24 '16 at 02:01
  • 1
    Does `gawk -v BINMODE=3 '{gsub(/[[:cntrl:]]/,"")}1` remove them? – Ed Morton May 24 '16 at 03:18
  • Thanks @Ed Morton, that seems to have done the trick! – Michel May 25 '16 at 18:24
  • OK, I posted it as an answer so you can accept it. – Ed Morton May 25 '16 at 19:15

1 Answers1

0
gawk -v BINMODE=3 '{gsub(/[[:cntrl:]]/,"")}1

will remove them.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185