I'm building a model to classify raw Wikipedia text by article quality (Wikipedia has a dataset of ~30,000 hand-graded articles and their corresponding quality grades.). Nonetheless, I am trying to figure out a way to algorithmically count the number of citations that appear on the page.
As a quick example: here is an excerpt from a raw Wiki page:
'[[Image:GD-FR-Paris-Louvre-Sculptures034.JPG|320px|thumb|Tomb of Philippe Pot, governor of [[Burgundy (region)|Burgundy]] under [[Louis XI]]|alt=A large sculpture of six life-sized black-cloaked men, their faces obscured by their hoods, carrying a slab upon which lies the supine effigy of a knight, with hands folded together in prayer. His head rests on a pillow, and his feet on a small reclining lion.]]\n[[File:Sejong tomb 1.jpg|thumb|320px|Korean tomb mound of King [[Sejong the Great]], d. 1450]]\n[[Image:Istanbul - Süleymaniye camii - Türbe di Roxellana - Foto G. Dall\'Orto 28-5-2006.jpg|thumb|320px|[[Türbe]] of [[Roxelana]] (d. 1558), [[Süleymaniye Mosque]], [[Istanbul]]]]\n\'\'\'Funerary art\'\'\' is any work of [[art]] forming, or placed in, a repository for the remains of the [[death|dead]]. [[Tomb]] is a general term for the repository, while [[grave goods]] are objects—other than the primary human remains—which have been placed inside.<ref>Hammond, 58–9 characterizes [[Dismemberment|disarticulated]] human skeletal remains packed in body bags and incorporated into [[Formative stage|Pre-Classic]] [[Mesoamerica]]n [[mass burial]]s (along with a set of primary remains) at Cuello, [[Belize]] as "human grave goods".</ref>
So far, I've concluded that I can find the number of images by counting the number of [[Image:
occurrences. I was hoping I could do something similar for references. In fact, after comparing raw Wiki pages and their corresponding live pages, I think I was able to determine that </ref>
corresponds to the end notation of a reference on a Wiki page. --> For example: Here, you can see that the author makes a statement at the end of the paragraph and references Hammond, 58–9 within <ref>
{text} </ref>
If somebody is familiar with Wiki's raw data and can shed some light on this, please let me know! Also, if you know a better way to do this, please tell me that, too!
Many thanks in advance!