R - Cannot Read File with Control Character [SUB]

Question

I've had this issue before, but my previous solution doesn't fix it.

In my text-data, in Notepad++ when I show all characters, a character listed as [SUB] appears.

PREVIOUSLY, I deleted these by doing this...

## Read the file in as Binary
r = readBin( curFile, raw(), file.info(curFile)$size)

## Convert the pesky characters
if ((r[1]==as.raw(0x1a)))
{
    ## Find it
    spot = which(r == as.raw(0x1a) )
    r[r == as.raw(0x1a)] = as.raw(0x20)
}

However, this isn't working. It seems like every time I manage to escape an invisible character, within a week, another one causes me a problem. Is there a way to just "clean" a file effectively of all invisible control characters other than the new-lines separating my data entries?

Please let me know. This is maddening already.

Thanks!

I can make a limited CSV file for you all to try. It's the second line, 4th column that causes the crash.

http://www.megafileupload.com/6ead/stackOverflow.csv

The entire code I was using to do this is below....

library(stringr)
############# DO THIS FIRST 
folder = "C:\\Twitter_TimeSeries\\Bernie_Practice\\"

## Get the file name of every file in the directory 
file.names = dir(folder, pattern=".csv")

## Figure out how many files there are
numFiles = length(file.names)

## Loop through every file 
for( i in 1:length(file.names))
{
    ## Which file are we on?
    curFile = paste( folder, file.names[i], sep="" )

    ## Read the file in as Binary
    r = readBin( curFile, raw(), file.info(curFile)$size)

    ## Convert the pesky characters
    if ((r[1]==as.raw(0x1a)))
    {
        ## Find it
        spot = which(r == as.raw(0x1a) )
        r[r == as.raw(0x1a)] = as.raw(0x20)
    } 
    if ((r[1]==as.raw(0x0a))) {
        ## Find it
        spot = which(r == as.raw(0x0a) )
        r[r == as.raw(0x1a)] = as.raw(0x20)
    } ## If 
    ## Re-write the file
    writeBin(r, curFile)
} ## For

curFile = stackOverflow.csv
rawData = read.csv(curFile, stringsAsFactors=FALSE)

Possible duplicate of [reading in a text file with a SUB (1a) (Control-Z) character in R on Windows](http://stackoverflow.com/questions/15874619/reading-in-a-text-file-with-a-sub-1a-control-z-character-in-r-on-windows) — crazybilly, Jan 30 '17 at 22:34

score 0 · Answer 1 · answered Mar 06 '16 at 17:00

0

Try using a regular expression to limit your data to only the allowable characters.

x = read.csv("foo.csv",colClasses="character") x = gsub("[^0-9\\.]","",x) # just numbers and '.' x = as.numeric(x) # Assuming your file really represents numeric data

answered Mar 06 '16 at 17:00

Pete Haverty

59
2

1

thanks for the suggestion. the problem is...i cant even read in the data to do this. it translated the invisible [SUB] character as an end of file so its never even read in to try this. – Jibril Mar 07 '16 at 04:28
Hmm, sounds like a job for perl. Maybe do perl -pi -e 's/[^0-9\.-eE\t]//g;' file.tsv first? That should clean up the file, in place, and then you can read it in R. The regex should allow any valid number characters, including scientific notation, but you may want to experiment a bit with that. – Pete Haverty Mar 08 '16 at 21:10

R - Cannot Read File with Control Character [SUB]

1 Answers1