1

I have a data frame and I want to remove all the rows starting with # . Can any body tell me how to do it. Thanks in advance.

#ID_REF = The name of the probe set, blank for control probes           
    #VALUE = The signal value calculated by MAS5, normalized            
    #ABS_CALL = The detection value calculated by the MAS5          
    #DETECTION P-VALUE = The detection p-value calculated by the MAS5           
    *ID_REF**   VALUE** ABS_CALL**  DETECTION P-VALUE*
    AFFX-BioB-5_at  757.7   P   0.00039
    AFFX-BioB-M_at  933.7   P   0.000095
    AFFX-BioB-3_at  525.6   P   0.000095
    AFFX-BioC-5_at  1999.5  P   0.000044
    AFFX-BioC-3_at  2339.5  P   0.000044
    AFFX-BioDn-5_at 4321.3  P   0.000044
    AFFX-BioDn-3_at 9229.4  P   0.00007
    AFFX-CreX-5_at  21949.9 P   0.000044
    AFFX-CreX-3_at  26022.8 P   0.000044
    AFFX-DapX-5_at  1171.1  P   0.00006
AwaitedOne
  • 992
  • 3
  • 19
  • 42

1 Answers1

1

The comment character (#) in some lines were not the first character. One way would be to remove the lines having the comment character (#) using grep ("lines2") and then read using read.csv

lines <- readLines('awaited.csv')
lines1 <- gsub('^ +| +$', '', lines)
lines2 <- lines1[!grepl('^#|^.*#', lines1)]
d1 <- read.csv(text=lines2, check.names=FALSE, stringsAsFactors=FALSE)
str(d1)
#'data.frame':  54682 obs. of  4 variables:
# $ *ID_REF**         : chr  "AFFX-BioB-5_at" "AFFX-BioB-M_at" "AFFX-BioB-3_at" "AFFX-BioC-5_at" ...
# $ VALUE**           : num  758 934 526 2000 2340 ...
# $ ABS_CALL**        : chr  "P" "P" "P" "P" ...
# $ DETECTION P-VALUE*: num  3.9e-04 9.5e-05 9.5e-05 4.4e-05 4.4e-05 4.4e-05 7.0e-05 4.4e-05 4.4e-05 6.0e-05 ...
head(d1,3)
#       *ID_REF** VALUE** ABS_CALL** DETECTION P-VALUE*
#1 AFFX-BioB-5_at   757.7          P            3.9e-04
#2 AFFX-BioB-M_at   933.7          P            9.5e-05
#3 AFFX-BioB-3_at   525.6          P            9.5e-05

Or you could use comment.char='#' argument in read.csv after removing all the other characters before # in those lines with # (sub(.*...)).

d2 <- read.csv(text=sub('.*(#.*)', '\\1', lines),
   check.names=FALSE, stringsAsFactors=FALSE, comment.char='#')
dim(d2)
#[1] 54682     4
head(d2,3)
#       *ID_REF** VALUE** ABS_CALL** DETECTION P-VALUE*
#1 AFFX-BioB-5_at   757.7          P            3.9e-04
#2 AFFX-BioB-M_at   933.7          P            9.5e-05
#3 AFFX-BioB-3_at   525.6          P            9.5e-05
akrun
  • 874,273
  • 37
  • 540
  • 662
  • For me same error : `Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 2 did not have 2 elements` – AwaitedOne Feb 10 '15 at 17:16
  • @AwaitedOne Try with `fill=TRUE` – akrun Feb 10 '15 at 17:24
  • @AwaitedOne I was copy/pasting your dataset and it worked for me – akrun Feb 10 '15 at 17:26
  • may b there is some mystery behind. The data is containing 54000 rows. I am not sure whether something is going wrong there. Always it gives me the same error with `read.table` to load the data. I am comfortable to load the data with read.csv. – AwaitedOne Feb 10 '15 at 17:30