How do I use pywikibot.Page(site, title).text when the title has an unescaped apostrophe (')?

Question

I have a list of strings called cities, where each string is a city name that is also the title of a wikipedia page. For each city, I'm getting the wikipedia page and then looking at the text content of it:

cities = [(n["name"]) for n in graph.nodes.match("City")]
for city in cities:
       site = pywikibot.Site(code="en", fam="wikivoyage")
       page = pywikibot.Page(site, city)
       text = page.text

One of the cities in my list is a place called L'Aquila and it was not returning anything for text (whereas other entries were). I figured that was because of the ' in the name. So I used re.sub to to escape the ' and pass in that result instead. This gives me what I expected:

cities = [(n["name"]) for n in graph.nodes.match("City")]
city = "L'Aquila"
altered_city = re.sub("'",  "\'", city)
print(altered_city)
site = pywikibot.Site(code="en", fam="wikivoyage")
page = pywikibot.Page(site, altered_city)
print(page)
print(page.text)

Result:

[[wikivoyage:en:L'Aquila]]
{{pagebanner|Pagebanner default.jpg}}
'''L'Aquila''' is the capital of the province of the same name in the region of [[Abruzzo]] in [[Italy]] and is located in the northern part of the..

But the issue is I don't want to hard-code the city name, I want to use the strings from my list. And when I pass this in, it does not give me any results for page.text:

cities = [(n["name"]) for n in graph.nodes.match("City")]
city_from_list = cities[0]
print(city_from_list)
print(type(city_from_list))
altered_city = re.sub("'",  "\'", city_from_list)
site = pywikibot.Site(code="en", fam="wikivoyage")
page = pywikibot.Page(site, altered_city)
print(page)
print(page.text)

Result:

L'Aquila
<class 'str'>
[[wikivoyage:en:L'Aquila]]

I printed out the value and type for the city element I'm getting from the list and it is a String, so I have no idea why it worked above but not here. How are these different?

score 1 · Answer 1 · answered Feb 12 '21 at 17:17

Pywikikbot works for L'Aquila as expected: e.g.

>>> import pywikibot
>>> site = pywikibot.Site('wikivoyage:en')
>>> page = pywikibot.Page(site, "L'Aquila")
>>> print(page.text[:100])
{{pagebanner|Pagebanner default.jpg}}
'''L'Aquila''' is the capital of the province of the same name

Seems your cities[0] is different from "L'Aquila". Note that page.text always gives a str and never return None. You may check for an existing page with the exists() method:

>>> page = pywikibot.Page(site, "L'Aquila")
>>> page.exists()
True
>>>

AXO · Accepted Answer · 2021-02-11T06:29:42.593

0

re.sub("'", "\'", city) does not do anything:

>>> city = "L'Aquila"
>>> re.sub("'",  "\'", city)
"L'Aquila"
>>> city == re.sub("'",  "\'", city)
True

Python treats "\'" as "'". See the table at Lexical analysis # String and Bytes literals of the documentation.

I don't know why the second portion of the code is not working for you, but it should. Maybe you just have not executed the last line. Even if page.text had returned None, the print statement should print None. Try print(type(page.text)).

edited Feb 11 '21 at 06:29

answered Feb 11 '21 at 06:14

AXO

8,198
6
62
63

1

Oh ok, thanks! I definitely executed the last line, and printing the type shows me that it's so I guess it's returning an empty string. – sam_ur_ai Feb 12 '21 at 04:35
The hard-coded value and the list value aren't the same for some reason. I printed out their values and types, and they both return 'L'Aquila' for the value and both are . But when I compare them using `==` it returns False. – sam_ur_ai Feb 12 '21 at 04:38
1

@sam_ur_ai so the strings are not equal but look the same? I'd suggest calling the `.encode('unicode-escape')` method on them and comparing their Unicode sequences. – AXO Feb 12 '21 at 05:08
1

That solved the mystery, thank you so much! When I compared the unicode sequences I was able to see that the hard-coded string was `b"L'Aquila"` whereas the string I was pulling from the list was `b'L\\u2019Aquila'`. I was able to fix this by calling `city = city.replace(u"\u2019", "'")` on the city. – sam_ur_ai Feb 12 '21 at 18:55

How do I use pywikibot.Page(site, title).text when the title has an unescaped apostrophe (')?

2 Answers2