-2

I'm trying to retrieve pageviews info on a page which is not retrieved, while other pages are. I get the error:

File "<unknown>", line 1
    article =='L'amica_geniale_ (serie_di_romanzi )'
                 ^
SyntaxError: invalid syntax

But there are no whitespaces in the text. this page is: https://it.wikipedia.org/wiki/L%27amica_geniale_(serie_di_romanzi)

The code is:

start_date = "2005/01/01"
headers = {
    'User-Agent': 'Mozilla/5.0'
}


def wikimedia_request(page_name, start_date, end_date = None):

    sdate = start_date.split("/")
    sdate = ''.join(sdate)
    

    r = requests.get(
        "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia.org/all-access/all-agents/{}/daily/{}/{}".format(page_name,sdate, edate),
        headers=headers
    )
    r.raise_for_status()  # raises exception when not a 2xx response
    result = r.json()
    df = pd.DataFrame(result['items'])
    df['timestamp'] = [i[:-2] for i in df.timestamp]
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df.set_index('timestamp', inplace = True)


    return df[['article', 'views']]


df = wikimedia_request(name="Random", start_date)

names = ["L'amica geniale"]

dfs = pd.concat([wikimedia_request(x, start_date) for x in names])

And the code works except for this page. I'm thinking it might be something with the apostrophe

1 Answers1

0

Pay attention to which url you are using. there's a difference between 'it.wikipedia.org' and 'en.wikipedia.org'

But works just fine when using the correct url. You could do something like this to account for it:

import requests
import pandas as pd
import datetime

start_date = "2005/01/01"
headers = {
    'User-Agent': 'Mozilla/5.0'
}


def wikimedia_request(page_name, start_date, end_date = None):

    sdate = start_date.split("/")
    sdate = ''.join(sdate)
    
    if end_date == None:
        end_date = datetime.datetime.now()
        edate = end_date.strftime("%Y%m%d")
    
    try:
        lang = 'en'
        url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/{}.wikipedia.org/all-access/all-agents/{}/daily/{}/{}".format(lang, page_name,sdate, edate)
        r = requests.get(url, headers=headers)
        r.raise_for_status()  # raises exception when not a 2xx response
    except:
        lang = 'it'
        url = "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/{}.wikipedia.org/all-access/all-agents/{}/daily/{}/{}".format(lang, page_name,sdate, edate)
        r = requests.get(url, headers=headers)
        r.raise_for_status()  # raises exception when not a 2xx response        
    result = r.json()
    df = pd.DataFrame(result['items'])
    df['timestamp'] = [i[:-2] for i in df.timestamp]
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df.set_index('timestamp', inplace = True)


    return df[['article', 'views']]


#df = wikimedia_request(name="Random", start_date)

names = ["L'amica geniale_(serie_di_romanzi)", "L'amica geniale"]

dfs = pd.concat([wikimedia_request(x, start_date) for x in names])

Output:

print(dfs)
                                       article  views
timestamp                                            
2018-11-21  L'amica_geniale_(serie_di_romanzi)    499
2018-11-22  L'amica_geniale_(serie_di_romanzi)    909
2018-11-23  L'amica_geniale_(serie_di_romanzi)    739
2018-11-24  L'amica_geniale_(serie_di_romanzi)    696
2018-11-25  L'amica_geniale_(serie_di_romanzi)   1449
                                       ...    ...
2022-03-06                     L'amica_geniale     30
2022-03-07                     L'amica_geniale     24
2022-03-08                     L'amica_geniale     15
2022-03-09                     L'amica_geniale     28
2022-03-10                     L'amica_geniale     18

[3499 rows x 2 columns]
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • Already done, it doesn't work unfortunately. Says requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/it.wikipedia.org/all-access/all-agents/L'amica_geniale_(serie_di_romanzi%20)/daily/20050101/20220311 – Idkwhatnomeis Mar 11 '22 at 14:02
  • Post a [MRCE](https://stackoverflow.com/help/minimal-reproducible-example) – Sören Mar 11 '22 at 14:04
  • the url is wrong: should be https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/it.wikipedia.org/all-access/all-agents/L'amica_geniale_(serie_di_romanzi)/daily/20050101/20220311 – chitown88 Mar 11 '22 at 14:06
  • Hi, thanks. I shared the full code now if you want to have a look. The url is right for every other page I tried that doesn't contain any apostrophe. I tried your way and it still doesn't work – Idkwhatnomeis Mar 11 '22 at 15:56
  • it works just fine. – chitown88 Mar 11 '22 at 16:25