1

I'm building a model to classify raw Wikipedia text by article quality (Wikipedia has a dataset of ~30,000 hand-graded articles and their corresponding quality grades.). Nonetheless, I am trying to figure out a way to algorithmically count the number of citations that appear on the page.

As a quick example: here is an excerpt from a raw Wiki page:

'[[Image:GD-FR-Paris-Louvre-Sculptures034.JPG|320px|thumb|Tomb of Philippe Pot, governor of [[Burgundy (region)|Burgundy]] under [[Louis XI]]|alt=A large sculpture of six life-sized black-cloaked men, their faces obscured by their hoods, carrying a slab upon which lies the supine effigy of a knight, with hands folded together in prayer. His head rests on a pillow, and his feet on a small reclining lion.]]\n[[File:Sejong tomb 1.jpg|thumb|320px|Korean tomb mound of King [[Sejong the Great]], d. 1450]]\n[[Image:Istanbul - Süleymaniye camii - Türbe di Roxellana - Foto G. Dall\'Orto 28-5-2006.jpg|thumb|320px|[[Türbe]] of [[Roxelana]] (d. 1558), [[Süleymaniye Mosque]], [[Istanbul]]]]\n\'\'\'Funerary art\'\'\' is any work of [[art]] forming, or placed in, a repository for the remains of the [[death|dead]]. [[Tomb]] is a general term for the repository, while [[grave goods]] are objects—other than the primary human remains—which have been placed inside.<ref>Hammond, 58–9 characterizes [[Dismemberment|disarticulated]] human skeletal remains packed in body bags and incorporated into [[Formative stage|Pre-Classic]] [[Mesoamerica]]n [[mass burial]]s (along with a set of primary remains) at Cuello, [[Belize]] as "human grave goods".</ref>

So far, I've concluded that I can find the number of images by counting the number of [[Image: occurrences. I was hoping I could do something similar for references. In fact, after comparing raw Wiki pages and their corresponding live pages, I think I was able to determine that </ref> corresponds to the end notation of a reference on a Wiki page. --> For example: Here, you can see that the author makes a statement at the end of the paragraph and references Hammond, 58–9 within <ref> {text} </ref>

If somebody is familiar with Wiki's raw data and can shed some light on this, please let me know! Also, if you know a better way to do this, please tell me that, too!

Many thanks in advance!


Austin
  • 401
  • 1
  • 4
  • 8

2 Answers2

1
  1. ref not always contains link to source. Sometimes contain specify explanations and etc.
  2. You must counting not only <ref>...</ref>, but also footnote templates.
  3. If you need count of unique refs, then you must except grouped refs (ref with name="xxx" parameter or auto grouped footnotes templates with same content).

Sorry for my English.

Siarhei
  • 214
  • 1
  • 4
0

Counting reference tags in wiki markup isn't necessarily accurate as references can be reused so that two </ref> would only show up as one reference in the list at the end. There is an API that should give a list of the articles, but for some reason it's deactivated, but BeautifulSoup makes this pretty simple. I haven't tested this to check it counts all articles correctly, but it works:

from bs4 import BeautifulSoup
import requests

page=requests.get('https://en.wikipedia.org/wiki/Stack_Overflow')       
soup=BeautifulSoup(page.content,'html.parser') 
count = 0
for eachref in soup.find_all('span', attrs={'class':'reference-text'}):
    count = count + 1

print (count)
smartse
  • 1,026
  • 7
  • 12