Importing data associated with a given CAS number from the NIST webbook web site into R

Question

I would like to retrieve information associated with a given CAS registry number (Chemical Abstracts Service nr) from the NIST webbook web site in R, using the provided API.

E.g. for cas nr. "19431-79-9" (Caryophylladienol II), http://webbook.nist.gov/cgi/cbook.cgi?ID=19431-79-9&Units=SI&Mask=2000#Gas-Chrom I got as far as

casno = "19431-79-9"
casno2 = gsub("-", "", casno)
raw=readLines(paste('http://webbook.nist.gov/cgi/cbook.cgi?ID=',casno,'&Units=SI&Mask=2000#Gas-Chrom', sep=""))

# mass spec, empty here, but not e.g. for casno2="630035" 
casno2="630035"
jcampfile = readLines(paste("http://webbook.nist.gov/cgi/cbook.cgi?JCAMP=C",casno2,"&Index=0&Type=Mass",sep=""))
if (jcampfile[[1]]=="##TITLE=Spectrum not found.") jcampfile=NA              

casno2 = gsub("-", "", casno)
# molecular stucture
molfile2d=readLines(paste("http://webbook.nist.gov/cgi/cbook.cgi?Str2File=C",casno2,sep=""))
if (molfile2d==character(0)) molfile2d=NA
molfile3d=readLines(paste("http://webbook.nist.gov/cgi/cbook.cgi?Str3File=C",casno2,sep=""))
if (molfile3d==character(0)) molfile3d=NA

From the following bits of the raw output I would then like to extract the following variables & lists:

"name=\" Top \">Caryophylladienol II</a></h1>" 
-> name="Caryophylladienol II"

"Formula</a>:</strong> C<sub>15</sub>H<sub>24</sub>O</li>\n \n \n<li><strong>" 
-> formula="C15H24O"

"Molecular weight</a>:</strong> 220.3505</li>\n \n \n<li>" 
-> MW=220.3505

"IUPAC Standard InChI:</strong>\n \n<br /><table>\n<tr><td>\n<ul style=\" list-style-type: circle;\">\n<li><tt>InChI=1S/C15H24O/c1-10-6-8-14(16)11(2)5-7-13-12(10)9-15(13,3)4/h12-14,16H,1-2,5-9H2,3-4H3/t12?,13?,14-/m1/s1</tt></li>\n" 
-> InChI="InChI=1S/C15H24O/c1-10-6-8-14(16)11(2)5-7-13-12(10)9-15(13,3)4/h12-14,16H,1-2,5-9H2,3-4H3/t12?,13?,14-/m1/s1" 

"IUPAC Standard InChIKey:</strong>\n<tt>CIIYOYPOMGIECX-JXQTWKCFSA-N</tt>" 
-> InChiKey="CIIYOYPOMGIECX-JXQTWKCFSA-N"

"Stereoisomers:....<strong>
-> stereoisomers=XXX (list of stereoisomers)

"Other names:...\n"
-> synonyms=XXX (list of synonyms)

"Normal alkane RI..."
-> list of measured RIs plus on which column they were measured
e.g. here RIs=c(1637,1631,1627,1656,1615,1638,1628,1602,1611,1635,1622,1622,1627); columns=c("HP-5 MS","DB-5","RTX-1","Col-Elite 5MS","DB-5","DB-5","DB-5","DB-1","DB-5","CP Sil 5 CB","BP-1","RTX-1","DB-5")

Any thoughts on how I would best do the latter type of parsing? Ideally this should then all be wrapped into a function that takes a list of CAS nrs as input, annotates them using info from the NIST webbook, and writes them to a text file. But no need to have it so polished - anything to get me started would help really!

Edit: I have been trying to parse the html file using htmlTreeParse in package XML, but I am not quite succeeding. Would anyone with a bit more experience with that function be able to help me out a bit by any chance?

Edit: I have figured out a solution to import the data in Mathematica, see https://mathematica.stackexchange.com/questions/37091/look-up-info-associated-with-a-given-cas-chemical-identifier-from-the-nist-webbo. If anyone would have the skill to port that code to R please let me know!

As far as processing the raw strings to get just your variables of interest, it looks like you need a `grep` approach using `gsub` to find everything between < and > (since these are html formating codes) and then replace it with an empty string. Once that is done, you can go at it on a case-by-case basis to get each variable formatted just the way you want it. — Bryan Hanson, Nov 15 '13 at 16:26
Yes I thought so - thanks - but I am just not so good with these pattern matching things. How would I find a string that lies between substrings " Top \" and "", say? — Tom Wenseleers, Nov 15 '13 at 16:45
There are those on this list who are quite fluent, but I'm not. Download yourself a regex cheat sheet and have a go at it. If you get stuck, that'd be a topic for a new question. Having done a lot of this, you'll be much better off to clean out all html and then look at the result, since probably 95% of the crap is html. Two separate stages, you won't get it all in one pass. — Bryan Hanson, Nov 15 '13 at 17:54
In Mathematica there is a nice option to import a web page as Data apparently, see http://mathematica.stackexchange.com/questions/37091/importing-data-associated-with-a-given-cas-number-from-the-nist-webbook-web-site, so the task seems relatively easier there. There is no corresponding R package by any chance that does something similar, and gets rid of all the HTML code etc? — Tom Wenseleers, Nov 15 '13 at 18:02
All things are possible with `R`! This [answer](http://stackoverflow.com/questions/10225690/removing-data-with-tags-from-a-vector) explains how to process your data manually or a with a bit more automation. — Bryan Hanson, Nov 15 '13 at 20:34
Ha many thanks - htmlTreeParse seems exactly what I need!! Thanks a lot for the pointer! I'll post back here when my function is finished, as I thought it would be useful for many people... — Tom Wenseleers, Nov 15 '13 at 20:40
I've seen the entire work flow that you are after in blog posts before, but html is a worthless keyword do those posts are hard to find. Anyway, you are making headway now. Have fun. — Bryan Hanson, Nov 15 '13 at 20:42

score 2 · Accepted Answer · answered Nov 18 '13 at 17:11

For the first URL string in your question, try

casno = "19431-79-9"
url <- paste('http://webbook.nist.gov/cgi/cbook.cgi?ID=',casno,'&Units=SI&Mask=2000#Gas-Chrom', sep="")
doc <- htmlParse(url)

name <- xpathSApply(doc, "//a[@id='Top']", xmlValue)
name
[1] "Caryophylladienol II"

Grab all lists with a bold title (some output truncated for display)

x <- xpathSApply(doc, "//li/strong/..", xmlValue)
x

[1] "Formula: C15H24O" 
[2] "Molecular weight: 220.3505" 
[3] "IUPAC Standard InChI:\n\n\nInChI=1S/C15H24O/c1-10-6-8-14(16)11(2)5-7-13-12(10)9-15(13,3)4/h12-14,16H,1-2,5-9H2, ...
[4] "IUPAC Standard InChIKey:\nCIIYOYPOMGIECX-JXQTWKCFSA-N" 
[5] "CAS Registry Number: 19431-79-9"  
[6] "Chemical structure: \nThis structure is also available as a 2d Mol file\n
[7] "Species with the same structure:\nCaryophylla-4(14), 8(15)-dien-5-ol\n\n"
[8] "Stereoisomers:\nCaryophylladienol I\nCaryophylla-3(15),7(14)-dien-6-ol\n«alpha»-Caryophylladienol\nExo methylene ...
[9] "Other names:\nCaryophylla-4(14),8(15)-dien-5«alpha»-ol;\nCaryophylla-2(12),6(13)-dien-5-«alpha»-ol;\nCaryophylla ...
[10] "Information on this page:\nGas Chromatography\nReferences\nNotes / Error Report\n\n"
[11] "Options:\nSwitch to calorie-based units\n\n"

If you are only writing to a file, then you could fix the delimited list in element 8 (replace newlines with semicolon) and remove the remaining newlines.

x <- gsub(":\n", ": ", x) 
x[8] <- gsub("\n+", ";", x[8])
x <- gsub("\n", "", x)
x <- gsub("Download the identifier in a file.", "", x)

Use readHTMLTable for tables

y <-readHTMLTable(doc, stringsAsFactors=FALSE)

then count rows to find the correct table and get values

sapply(y, nrow)
NULL NULL NULL NULL NULL NULL 
   1    1    5   13    6    1 

y[[4]][,2:3]
    Active phase     I
1        HP-5 MS 1637.
2        DB-5 MS 1631.
3          RTX-1 1627.
4  Col-Elite 5MS 1656.
5           DB-5 1615.
...

ri <- paste0(gsub(".", "", y[[4]][,3], fixed=TRUE), "=", y[[4]][,2], collapse=";")
ri
[1] "1637=HP-5 MS;1631=DB-5 MS;1627=RTX-1;1656=Col-Elite 5MS;1615=DB-5;1638=DB-5;1628=DB-5;1602=DB-1;1611=DB-5;1635=CP Sil 5 CB;1622=BP-1;1622=RTX-1;1627=DB-5"

Finally, combine and write to a file

cas <- c(paste("Name:", name), x[c(1:5,7:9)], paste("RI:", ri) )
write( cas, file="cas.out")

There are other ways to grab the values in unordered lists, for example, to get all stereoisomers as a vector...

stereo <- xpathSApply(doc, "//li/strong[text()='Stereoisomers:']/../ul/li/a", xmlValue)
 [1] "Caryophylladienol I"                       "Caryophylla-3(15),7(14)-dien-6-ol"         "«alpha»-Caryophylladienol"                
 [4] "Exo methylene isomer of Caryophyllenol I"  "«beta»-Caryophylla-4(14),8(15)-dien-5-ol"  "Caryophylla-4(12),8(13)-dien-5-«beta»-ol" 
 [7] "Caryophylla-4,8-dien-5-ol"                 "Caryophylla-4(12),8(13) diene 5 «beta»-ol" "Caryophyla-4(14),8(15)-dien-5-ol"         
[10] "Caryophylla-4(12).8(13)-diene-5«beta»-ol"  "2(12),6(13)-Caryophylladien-5-ol"

and then write multiple lines to a file instead.

paste("Stereoisomer:", stereo)

Oh yes and one further question - once you have found the nrs of rows in each of the RI tables using nrows=sapply(y, nrow), how do I automatically select only those tables for which nrows is not NULL, and combine them using rbind? (tables which have 5 columns would first have to have a column with "NA"s inserted before column "I" though (this is because some RI tables have 5 and others 6 columns) — Tom Wenseleers, Nov 19 '13 at 09:34
Add some checks and then rbind. n <-sapply(y, function(z) nrow(z)>1 & names(z)[3]=="I") and then data.frame(do.call("rbind", y[n])[,2:3], row.names=NULL) — Chris S., Nov 19 '13 at 18:49
Hey there - if I try this n <-sapply(y, function(z) nrow(z)>1 & names(z)[3]=="I") and then data.frame(do.call("rbind", y[n])[,2:3], row.names=NULL) I get the error Error in y[n] : invalid subscript type 'list' - what am I doing wrong? — Tom Wenseleers, Nov 20 '13 at 16:18
Not sure. Make sure n is a logical vector and check length( y[n]) and class(y[n]). You many need to check column names as well before "rbind" to make sure they are the same (but that should throw a different error about match.names) — Chris S., Nov 20 '13 at 18:39

Importing data associated with a given CAS number from the NIST webbook web site into R

1 Answers1