I am currently using this code bellow to count the amount of text
elements there are in the xml
file.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('wiki.xml'), 'lxml')
count = 0
for text in soup.find_all('text', recursive=False):
count += 1
print(count)
I am unable to display the full xml
file because of its size but here is a quick snippet of it...
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>simplewiki</dbname>
<base>https://simple.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.30.0-wmf.14</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2" case="first-letter">Media</namespace>
<namespace key="-1" case="first-letter">Special</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Talk</namespace>
<namespace key="2" case="first-letter">User</namespace>
<namespace key="3" case="first-letter">User talk</namespace>
<namespace key="4" case="first-letter">Wikipedia</namespace>
<namespace key="5" case="first-letter">Wikipedia talk</namespace>
<namespace key="6" case="first-letter">File</namespace>
<namespace key="7" case="first-letter">File talk</namespace>
<namespace key="8" case="first-letter">MediaWiki</namespace>
<namespace key="9" case="first-letter">MediaWiki talk</namespace>
<namespace key="10" case="first-letter">Template</namespace>
<namespace key="11" case="first-letter">Template talk</namespace>
<namespace key="12" case="first-letter">Help</namespace>
<namespace key="13" case="first-letter">Help talk</namespace>
<namespace key="14" case="first-letter">Category</namespace>
<namespace key="15" case="first-letter">Category talk</namespace>
<namespace key="828" case="first-letter">Module</namespace>
<namespace key="829" case="first-letter">Module talk</namespace>
<namespace key="2300" case="first-letter">Gadget</namespace>
<namespace key="2301" case="first-letter">Gadget talk</namespace>
<namespace key="2302" case="case-sensitive">Gadget definition</namespace>
<namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
<namespace key="2600" case="first-letter">Topic</namespace>
</namespaces>
</siteinfo>
<page>
<title>April</title>
<ns>0</ns>
<id>1</id>
<revision>
<id>5753795</id>
<parentid>5732421</parentid>
<timestamp>2017-08-11T21:06:32Z</timestamp>
<contributor>
<ip>2602:306:3433:C7F0:188F:FDE3:9FBE:D0B0</ip>
</contributor>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">{{monththisyear|4}}
'''April''' is the fourth [[month]] of the [[year]], and comes between [[March]] and [[May]]. It is one of four months to have 30 [[day]]s.
April always begins on the same day of week as [[July]], and additionally, [[January]] in leap years. April always ends on the same day of the week as [[December]].
April's [[flower]]s are the [[Sweet Pea]] and [[Asteraceae|Daisy]]. Its [[birthstone]] is the [[diamond]]. The meaning of the diamond is innocence.
In short for the final product I would like it to be able to search through the page
elements to find the titles
in which it will search for a specific phrase I have entered and then return the text
element inside of that page, as well as if it can't find a result then it returns the top three most similar. Is this possible and can anyone help with it? I am flexible with the library used, meaning it doesn't have to be bs4
. Thank you.
EDIT:
I've just found out that if I remove recursive=False
from the above code it returns 1
rather than 0
. No idea why?
EDIT:
I have also tried the bellow code but it too returns 0
. Bellow is also the example of what I would like for the final product, all in a dictionary.
import xml.etree.ElementTree as ET
def get_data():
tree = ET.parse(open("wiki.xml"))
root = tree.getroot()
results = {}
for title in root.findall('./page/title') and text in root.findall('./page/revision/text'):
results[title] = text
return results
r = get_data()
print(len(r))
EDIT:
I have just tried some code on the xml
file bellow...
<vehicles>
<car name="BMW">
<model>850 CSI</model>
<speed>1000</speed>
</car>
<car name="Mercedes">
<model>SL65</model>
<speed>900</speed>
</car>
<car name="Jaguar">
<model>EV400</model>
<speed>850</speed>
</car>
<car name="Ferrari">
<model>Enzo</model>
<speed>2</speed>
</car>
</vehicles>
This is the code I used...
from bs4 import BeautifulSoup
def get_data():
soup = BeautifulSoup(open('test.xml'), 'lxml')
count = 0
for text in soup.select("vehicles car model"):
count += 1
return count
r = get_data()
print(r)
This script returned 4
which is the correct number. However when I change vehicles car model
to page revision text
and try it on the wiki.xml
file it does not work and still returns 1
. Note: In the wiki
file there are more text elements then I have the time to count myself so 1
is defiantly incorrect.
EDIT:
This is the code I have been trying to use for parsing the file...
def parser(file_name="wiki.xml",save_to="weboffline.csv",url='http://www.mediawiki.org/xml/export-0.10/'):
doc = tree.parse(file_name)
titles = []
texts = []
for title in doc.findall('.//mediawiki{'+url+'}//page//title'):
titles.append(title)
for text in doc.findall('.//mediawiki{'+url+'}//page//revision//text'):
texts.append(text)
with open(save_to, mode='w') as file:
writer = csv.writer(file)
writer.writerow(['TITLES', 'TEXT'])
for items in zip(titles,texts):
writer.writerow(items)
However the CSV file this returns in just TITLES,TEXT
. Does anyone have a solution?