Following up on my earlier question here: Row limit in read.table.ffdf?
I have a text file with >285 million records, but about two-thirds of the way through there are several non-ASCII characters that are being interpreted by AWK as well as several R packages (ff, data.table) as EOF bytes. It appears that the characters were originally entered as degree signs, but appear in text editors as boxes (see example here). When I try to read in the text file using these methods it just stops when it encounters the first character, with no error messages as if it's complete.
For now I was able to open the file in a text editor to remove these characters. But this is not a long-term solution for this dataset given its size; I need to be able to remove or bypass them without having to open the whole file. I've tried using the quote option in R, and tried replacing all non-ASCII and 'CTRL-M' characters specifically during an awk import, but the read process always stops at the first character. Any solutions? I'm using R and awk now, but am open to other options (python?). Thanks!