0

I can't get this to work. I want to replace all two character occurences in the first field of a csv file with the occurence and an X appended, and whitespace removed. For example SA and SA should map to SAX in the new file. Below is what I tried with sed (from help through an earlier question)

system( paste("sed ","'" ,'  s/^GG/GGX/g; s/^GG\\s/GGX/g;  s/^GP/GPX/g;
 s/^GP\\s/GPX/g; s/^FG/FGX/g; s/^FG\\s/FGX/g; s/^SA/SAX/g; s/^SA\\s/SAX/g; 
 s/^TP/TPX/g; s/^TP\\s/TPX/g   ',"'",' ./data/concat_csv.2 >     
./data/concatenated_csv.2 ',sep=''))

I tried using the sQuote() function, but this still doesn't help. The file has problems being handled by read.csv because there are errors within some fields based on too many and not enough separators on certain lines.

I could try reading in and editing the file in pieces, but I don't know how to do that as a streaming process.

I really just want to edit the first field of the file using a system() call. The file is about 30GB.

Yoda
  • 397
  • 5
  • 18
  • Please define "too large" and "too complicated" . R has packages to deal with large files, and there are tons of filters for `read.table` or `scan` . – Carl Witthoft Jan 24 '13 at 14:06

1 Answers1

0

try the following on a file like so:

echo "fi,second,third" | awk '{len = split($0,array,","); str = ""; for (i = 1; i <= len; ++i) if (i == 1) { m = split(array[i],array2,""); if (m == 2) {str = array[i]"X";} else {str = array[i]};} else str = str","array[i]; print str;}' 

so you would call it from R using the following as input to the paste() call

cat fileNameToBeRead | awk '{len = split($0,array,","); str = ""; for (i = 1; i <= len; ++i) if (i == 1) { m = split(array[i],array2,""); if (m == 2) {str = array[i]"X";} else {str = array[i]};} else str = str","array[i]; print str;}' > newFile

this code won't handle your whitespace requirement though. could you provide examples to demonstrate the sort of functionality you're looking at

Aditya Sihag
  • 5,057
  • 4
  • 32
  • 43