I'm trying to extract interlanguage related articles in Wikidata dump. After searching on the internet, I found out there is a tool named Wikidata Toolkit that helps to work with these type of data. But there is no information about how to find related articles in different languages. For example, the article: "Dresden" in the English language is related to the article: "Dresda" in the Italiano one. I mean the second one is the translated version of the first one. I tried to use the toolkit, but I couldn't find any solution. Please write some example about how to find this related article.
Asked
Active
Viewed 1,430 times
1
-
Ideas: https://stackoverflow.com/questions/48332827/how-to-get-associated-english-wikipedia-page-from-wikidata-page-q-number-usi#comment83696903_48332827 – Stanislav Kralin Jan 23 '18 at 06:42
-
Thank you Stanislav. I need to investigate the full version of english wikipedia articles (with their content) and it's Spanish translated version. Do you know how to extract this articles and their translated version using Wikidata Toolkit. Could you please introduce methods of Wikidata Toolkit those are related to extracting these interlingual related articles? – SahelSoft Jan 23 '18 at 07:00
-
See the example file [SitelinksExample.java](https://github.com/Wikidata/Wikidata-Toolkit/blob/master/wdtk-examples/src/main/java/org/wikidata/wdtk/examples/SitelinksExample.java). – Tgr Jan 30 '18 at 06:41
-
Thanks @Tgr. But this example don't extract interlanguage articles :( – SahelSoft Jan 30 '18 at 13:31
-
Well, no, it's the *Wikidata* toolkit. Wikidata does not contain those aritcles. But the toolkit tells you what the articles are. – Tgr Jan 31 '18 at 01:19
1 Answers
2
you can use Wikidata dump [1] to get a mapping of articles among wikipedias in multiple language.
for example if you see the wikidata entry for Respiratory System[2] at the bottom you see all the articles referring to the same topic in other languages.
That mapping is available in the wikidata dump. Just download wikidata dump and get the mapping and then get the corresponding text from the wikipedia dump. You might encounter some other issues, like resolving wikipedia redirects.
[1] https://dumps.wikimedia.org/wikidatawiki/entities/ [2] https://www.wikidata.org/wiki/Q7891

David Przybilla
- 830
- 6
- 16
-
Thanks @David. Does Wikidata Toolkit give me the content (text) of each related article or I myself should write a code to extract those? The size of dump file is huge and it's so difficult for me to download and analyze that. – SahelSoft Feb 05 '18 at 08:51
-
Could you please give me the address of Wikipedia dump. I can't find any wikipedia dump. Apparently, It combined with Wikimedia project and I don't know what file I should download. Thank you. – SahelSoft Feb 05 '18 at 10:06
-
I think wikidata might contain the english abstract but definitely not the text for all languages. – David Przybilla Feb 05 '18 at 10:53
-
@SahelSoft what you can do is use a project like: https://github.com/idio/json-wikipedia to generate a json wikipedia for the languages that you need – David Przybilla Feb 05 '18 at 10:54
-
@SahelSoft with respecto tot he dumps https://dumps.wikimedia.org/backup-index.html there you cna find them. for example enwiki-20180201-pages-articles-multistream.xml.bz2 is the english wikipedia. eswiki : pages articles would be the spanish wikipedia... and so on.. – David Przybilla Feb 05 '18 at 10:55
-
Thank you @David. The github link might be useful. Another question: What is difference of standard article and redirection one? – SahelSoft Feb 05 '18 at 20:43
-
@SahelSoft a wikipedia article can have many aliases, for example https://en.wikipedia.org/wiki/New_York_City and https://en.wikipedia.org/wiki/NYC a Redirect take aliases to their canonical name. the canonical name can change between wikipedia dumps. That is one of the motivations for Wikidata to use more abstract names for the topics: Q123 (for example) – David Przybilla Feb 06 '18 at 00:34