0

I have a CSV file which mixes quoting and unquoting that gives R problems when trying to read it in. The issue arises with commas within the quotes, it delimits on these but I want them ignored. When viewing the CSV in Excel, it manages it perfectly and understands where to break. Is there a way these settings can be viewed/translated to R?

Here is the link to download the file in question, it's a set of gene ontologies and their associated terms and whether or not the gene is part of it (0 or 1). It should be 4 columns of text, 1 column of pValues, and 50 columns of 0/1.

I've tried reading it into R with read.table(file, quote="\"", sep=",", row.names=NULL), but the values from Category, Name, Verbose ID spill into the pValue and then affect the counts data. Then entire rows of data may be put into one cell until another misinterpreted delimiter arises.

Here's an example problem line, with some of the last columns of 0/1 redacted for length.

"Pubmed","Expression of epidermal growth factors, erbBs, in the nasal mucosa of patients with chronic hypertrophic rhinitis.","22327010","pubmed_22327010_Expression_of_epidermal_growth_factors,_erbBs,_i...",0.005837270080633278,0,0,0,0,0,1,0,...
TomNash
  • 3,147
  • 2
  • 21
  • 57
  • I followed the link, but it wasn't immediately obvious (for those not used to the site) how to get just the CSV you describe here ...) – Ben Bolker Jul 14 '16 at 14:55

3 Answers3

0

Turns out the R command data.table::fread does exactly what I want, found from this SO post

Community
  • 1
  • 1
TomNash
  • 3,147
  • 2
  • 21
  • 57
0

Hmm, I can't replicate. Using quote="\"", sep="," seems to give what you're asking for ...

 example_line <- '"Pubmed","Expression of epidermal growth factors, erbBs, in the nasal mucosa of patients with chronic hypertrophic rhinitis.","22327010","pubmed_22327010_Expression_of_epidermal_growth_factors,_erbBs,_i...",0.005837270080633278,0,0,0,0,0,1,0'
 r <- read.table(header=FALSE,quote="\"",sep=",",text=example_line,stringsAsFactors=FALSE)
 str(r)
## 'data.frame':    1 obs. of  12 variables:
##  $ V1 : chr "Pubmed"
##  $ V2 : chr "Expression of epidermal growth factors, erbBs, in the nasal mucosa of patients with chronic hypertrophic rhinitis."
##  $ V3 : int 22327010
##  $ V4 : chr "pubmed_22327010_Expression_of_epidermal_growth_factors,_erbBs,_i..."
##  $ V5 : num 0.00584
##  $ V6 : int 0
##  $ V7 : int 0
##  $ V8 : int 0
##  $ V9 : int 0
##  $ V10: int 0
##  $ V11: int 1
##  $ V12: int 0
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
0

read_cvs from the readr package is also a possibility. It can apparently deal with odd sort of oddities

Ulrik
  • 1,575
  • 2
  • 10
  • 10