Selecting xml_nodes for patent data using library(rvest) and library(xml) in R

Question

Given the following (quasi)xml-structure of EPO's patent Server REPO:

<ep-patent-document id="EP79301547B1" file="EP79301547NWB1.xml" lang="en" country="EP" doc-number="0007815" kind="B1" date-publ="19871021" status="n" dtd-version="ep-patent-document-v1-1">
<SDOBI lang="en">
<B000>...</B000>
<B100>...</B100>
<B200>
<B210>79301547.0</B210>
<B220>
<date>19790801</date>
</B220>
<B240/>
<B250>en</B250>
<B251EP>en</B251EP>
<B260>en</B260>
</B200>
<B300>...</B300>
<B400>...</B400>
<B500>...</B500>
<B700>...</B700>
<B800>...</B800>
</SDOBI>
<!--  EPO <DP n="1">  -->
<!--  EPO <DP n="2">  -->
<description id="desc" lang="en">...</description>
<claims id="claims01" lang="en">...</claims>
<claims id="claims02" lang="de">...</claims>
<claims id="claims03" lang="fr">...</claims>
</ep-patent-document>

I would like to select the number in node "B210" and the text in "description".

Using

library(httr)
library(rvest)
library(XML)
library(magrittr)

files1993 <- list.files("~/Downloads", full.names=TRUE, recursive=TRUE)
y <- files1993[1]
parse1993 <- htmlParse(y) 

parse1993 %>% xml_nodes("description")
parse1993 %>% xml_nodes("SDOBI") %>% xml_nodes("B210")

I do get the description text but nothing for B210. In fact, the command won't work for any information given in . Do I have to convert the information given in SDOBI into text? I am a little lost here. Any help highly appreciated.

why not use `xmlParse`? `parse1993 %>% xml_nodes("SDOBI") %>% xml_nodes("B210")` works fine then — hrbrmstr, Feb 23 '15 at 14:29
Try `xml2` from Hadley: `library(xml2);xml <- xml('...your.example...');xml_text(xml_find(xml, "//B210 | //description"));# [1] "79301547.0" "..." `. — lukeA, Feb 23 '15 at 14:32

score 0 · Answer 1 · answered Feb 20 '16 at 00:13

I am sorry that this response is a bit late but wanted to respond anyway in case someone else needs help on this same topic.

First of all, working with the EPO api is a huge pain in the butt. There xml is a bear and the data can be quite dirty and inconsistent.

PatentData.io seems to be a better option. They have the EPO data sets, cleaned, and piped out through a modern RESTful JSON api. rjson is much easier to work with. They also provide some cool advanced searching and analytics functions if you are looking to get fancy.

They are still in beta now but I think they are actively taking new beta users. Check it out.

Selecting xml_nodes for patent data using library(rvest) and library(xml) in R

1 Answers1