-1

So I want to search the wikipedia database for some keywords and then extract the text that the relative pages have to then use for a tf-idf module to later on implement in a text classification program. I am currently looping through a pandas dataframe with all the keywords and then searching the wikipedia database for the respective keywords, but I am getting an error saying the webpage does not exists. Here is my code:

import wikipedia
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df_wiki_pages=pd.read_csv(r'C:\\Users\\jason\\Downloads\\Categories.csv',usecols=[0])

df_wiki_pages = df_wiki_pages.dropna()
print(df_wiki_pages)
wikipages = []
 for pages in wikipages:

tokenized_texts = []
for index, row in df_wiki_pages.iterrows():
   currentRow = row['Categories']
   print("Now testing: "+ currentRow)
   wiki = wikipedia.page('currentRow')

It gives me this error once it loops to the keyword "Customer_advocacy": PageError: Page id "customer advocate" does not match any pages. Try another id!

I do not understand why it is searching for 'customer advocate' since my query is for 'Customer_advocacy'. I do not understand why it changes the query by itself because the page for 'Customer_advocacy' exists meanwhile the page for 'customer advocate' does not. Am I doing something wrong in my query?

1 Answers1

0

Try setting the auto_suggest flag to False:

wiki = wikipedia.page(currentRow, auto_suggest=False)

If we try this on the problematic string, "Customer_advocacy," it seems to work:

import wikipedia

wiki = wikipedia.page("Customer advocacy", auto_suggest=False)
print(wiki) # <WikipediaPage 'Customer advocacy'>

Currently, your implementation uses the string 'currentRow' in the call to wikipedia.page. I assume that is a typo.

BrownieInMotion
  • 1,162
  • 6
  • 11