How can I extract multiple items from 1 html using RCrawler's ExtractXpathPat?

Question

I'm trying to get both the label and data of items of a museum collection using Rcrawler. I think I made a mistake using the ExtractXpathPat variable, but I can't figure out how to fix it.

I expect an output like this:

1;"Titel(s)";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
1;"Objecttype";"Schilderij"
1;"Objectnummer";"SK-A-2931"

However the output file repeats the title in the 3rd position:

1;"Titel(s)";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
1;"Objecttype";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"
1;"Objectnummer";"De StaalmeestersDe waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"

The HTML looks like this:

<div class="item">
      <h3 class="item-label h4-like">Objectnummer</h3>
      <p class="item-data">SK-A-2931</p>
</div>

My method looks like this:

Rcrawler(Website = "https://www.rijksmuseum.nl/nl/", 
         no_cores = 4, no_conn = 4,
         dataUrlfilter = '.*/collectie/.*',
         ExtractXpathPat = c('//*[@class="item-label h4-like"]', '//*[@class="item-data"]'), 
         PatternsNames = c('label','data'),
         ManyPerPattern = TRUE)

Clarification of goal The HTML page doesn't always have the same labels and sometimes it has labels without the corresponding data. Sometimes the data is in a paragraph and sometimes in an unordered list.

My end goal is to create a csv that has all the labels of the site with the corresponding data in each row.

This question is to get to the first step of collecting the labels and data, which I will then use to create the above mentioned csv.

E.Wiest · Accepted Answer · 2020-03-04T04:56:20.677

I don't use RCrawler to scrape but I think your XPaths need to be fixed. I did it for you :

Rcrawler(Website = "https://www.rijksmuseum.nl/nl/", 
         no_cores = 4, no_conn = 4,
         dataUrlfilter = '.*/collectie/.*',
         ExtractXpathPat = c("//h3[@class='item-label h4-like'][.='Titel(s)']/following-sibling::p/text()","//h3[@class='item-label h4-like'][.='Objecttype']/following::a[1]/text()","//h3[@class='item-label h4-like'][.='Objectnummer']/following-sibling::p/text()"), 
         PatternsNames = c("Titel(s)", "Objecttype","Objectnummer"),
         ManyPerPattern = TRUE)

I run it for a few minutes and it seems to work :

DATA[[1]]
$`PageID`
[1] 1

$`Titel(s)`
[1] "De Staalmeesters"                                                                   
[2] "De waardijns van het Amsterdamse lakenbereidersgilde, bekend als ‘De Staalmeesters’"

$Objecttype
[1] "schilderij"

$Objectnummer
[1] "SK-C-6"

More options :

Bruteforce. Since you don't know yet all the label names, and if you don't want to write specific XPaths you can try something like this in RCrawlers ExtractXpathPat:

c("string((//h3[@class='item-label h4-like'])[1]/parent::*)","string((//h3[@class='item-label h4-like'])[2]/parent::*)",...,"string((//h3[@class='item-label h4-like'])[30]/parent::*)")

Here, we just increment from position 1 to position 30. You could try 40,50, it's up to you.

PatternsNames = c("Item1", "Item2",...,"Item30")

Example of result :

Item1:Title(s) The Seven Works of MercyPolyptych with the Seven Works of Charity 
Item2:Object type painting 
Item3:Object number SK-A-2815
...
Item17:Parts The Seven Works of Mercy (SK-A-2815-1) The Seven Works of Mercy (SK-A-2815-2) The Seven Works of Mercy (SK-A-2815-3) The Seven Works of Mercy (SK-A-2815-4) The Seven Works of Mercy (SK-A-2815-5) The Seven Works of Mercy (SK-A-2815-6) The Seven Works of Mercy (SK-A-2815-7)
...
Item29:
Item30:

You need then to tidy the data (split, trim, reorganize...) with appropriate tools (dplyr, stringr) to generate a proper csv.

If this option doesn't work, you could determine all the label names you could possibly have (get all the //h3[@class='item-label h4-like']/text() of the webpages and remove duplicates to keep unique values only. Then write the Xpaths accordingly. This way the .csv would be easier to generate.

You could also work outside RCrawler (with other tools) and write some functions to scrape the data properly (with apply functions or for loops).

I need to get all the item-data with it's labels and not every page has the same data. Is there a way to create a generic xpath that does what your solution does? — Friso, Mar 03 '20 at 17:29
I have updated my answer. ~675 000 works to scrape. That's a long way. Unless you're only interested with the paintings. — E.Wiest, Mar 04 '20 at 04:59
this seems to work, although I haven't gone through all 675000 yet — Friso, Mar 04 '20 at 20:48

How can I extract multiple items from 1 html using RCrawler's ExtractXpathPat?

1 Answers1