2

I wanna use Sheets to query pages from wikidata and scrape a specific section but I couldn't find anything focused specifically on this, and since I'm a beginner in this I don't know really where to start from. So, I have a list of Q identifiers and I'd like to use them to query the page, and then check if there's a specific section there (or scrape the data from it if possible) otherwise return false. I started with what I found here

=ImportXml(concat("https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&
ppprop=wikibase_item&redirects=1&format=xml&titles=",G1),"//@wikibase_item")

but used instead the link for wiki data (it generates a working link, but I'm not sure if this works as the equivalent of an API and if I can further query it to get data) and the wiki Property code that I wanna get (date of death, /wiki/Property:P570), but I get the "Imported content is empty." error, for this link. Ideally, I would've like to get the date of death value (20 November 2014), or at least a TRUE , meaning that the section exists, and the person is dead.

=IMPORTXML(CONCAT("https://www.wikidata.org/wiki/",A2),"/wiki/Property:P570")

So I might have some Q-links that don't have this section/property at all, and for which I should get an error, but I don't know why it's not working for this one either, do I have to make Xpath to div, or can I use the wiki Property?

I hope this makes sense, I'll put the sample sheet here. Thanks

player0
  • 124,011
  • 12
  • 67
  • 124
Debs
  • 133
  • 7

1 Answers1

4

try:

=QUERY(IMPORTXML("https://www.wikidata.org/wiki/"&A1, "//*"), 
 "select Col2 where Col1 = 'date of death' and Col2 is not null")

enter image description here

or:

=QUERY(IMPORTXML("https://www.wikidata.org/wiki/"&A1, "//*"), 
 "select Col2 where Col1 = 'date of death' and Col2 is not null")<>""

and for no match:

=IFERROR(QUERY(IMPORTXML("https://www.wikidata.org/wiki/"&A1, "//*"), 
 "select Col2 where Col1 = 'date of death' and Col2 is not null"), FALSE)

enter image description here


=IFERROR(REGEXEXTRACT(QUERY(IMPORTXML("wikidata.org/wiki/"&A2, "//*"), 
 "select Col2 where Col1 = 'date of birth' and Col2 is not null"), 
 "(.*) \d.*reference.*"), FALSE)
player0
  • 124,011
  • 12
  • 67
  • 124
  • It's an awesome solution. Do you have any idea which version of xpath is compatible with IMPORTXML? – SIM Oct 15 '21 at 22:20
  • 2
    @SIM gs uses the 1.0 version tho not all quirks are fully supported so lets say its 0.85 ;) – player0 Oct 15 '21 at 22:35
  • Thanks, I ended up using a combination between the first one and last one – Debs Oct 16 '21 at 07:11
  • and adding a Regex to extract just the year (without the reference part). =IFERROR(REGEXEXTRACT(QUERY(IMPORTXML("https://www.wikidata.org/wiki/"&A2, "//*"), "select Col2 where Col1 = 'date of birth' and Col2 is not null"),"(.*)\d.*reference.*"),FALSE) – Debs Oct 16 '21 at 07:38