0

I want to get the out-links of wikipedia articles. What I mean by out-linkes are the links in What links here section in wikipedia articles.

For instance, consider the data mining wikipedia article. What links here section of this article is in: https://en.wikipedia.org/wiki/Special:WhatLinksHere/Data_mining

I tried to used pywikibot as follows.

import pywikibot as pw

site = pw.Site('en', 'wikipedia')
print([
    cat.title()
    for cat in pw.Page(site, 'data mining').categories()
    if 'hidden' not in cat.categoryinfo
])

However, it seems like the categories in pywikibot is different to out-links of wikipedia articles. Therefore, I am wondering how to do this in python.

Note: I am not limited to pywikibot and happy to explore other libraries such as mediawiki.

I am happy to provide more details if needed.

EmJ
  • 4,398
  • 9
  • 44
  • 105

1 Answers1

3

Try Page.embeddedin() and Page.backlinks() methods. You could also directly use the equivalent modules of MediaWiki's API:

AXO
  • 8,198
  • 6
  • 62
  • 63
  • thank you very much for the answer. it would be really great if you could tell me what is the difference of 'embeddedin' and `backlinks`? Does it return similar restults? Looking forward to hearing from you :) – EmJ Feb 06 '20 at 07:06
  • 1
    @EmJ At the top of the [WhatLinksHere page](https://en.wikipedia.org/wiki/Special:WhatLinksHere/Data_mining), you'll notice a few filters: `Hide transclusions | Hide links | Hide redirects `. WhatLinksHere, by default, retrieves a combination of [tranclusions](https://en.wikipedia.org/wiki/Wikipedia:Transclusion) (embeddings) and [backlinks](https://en.wikipedia.org/wiki/Backlink). Using the above methods/modules you should be able to retrieve any of them. – AXO Feb 06 '20 at 07:15
  • Thank you very much. The details you have provided were very useful for me to understand the difference :) – EmJ Feb 06 '20 at 23:25
  • Hi, I did some testing of the methods you suggested me. It looks like pywikibot suggests backlinks such as `[[en:User talk:202.58.134.131]]` which I am not interested on. Is it possible to filter such irrelevant backlinks from the list? Looking forward to hearing from you. Thank you :) – EmJ Feb 08 '20 at 11:22
  • 1
    Use the `namespaces` argument. – AXO Feb 08 '20 at 12:48
  • thank you very much for the comment. However, I still could not figured out how `namespaces` is used. Would you be able to show some example to follow? thank you very much. I look forward to hearing from you :) – EmJ Mar 02 '20 at 12:12