I've just started using BeautifulSoup and came across an obstacle at the very beginning. I looked up similar posts but didn't find a solution to my specific problem, or there is something fundamental I’m not understanding. My goal is to extract Japanese words with their English translations and examples from this page.
https://iknow.jp/courses/566921
and save them in a dataFrame or a csv file.
I am able to see the parsed output and the content of some tags, but whenever I try requesting something with a class I'm interested in, I get no results. First I’d like to get a list of the Japanese words, and I thought I should be able to do it with:
import urllib
from bs4 import BeautifulSoup
url = ["https://iknow.jp/courses/566921"]
data = []
for pg in url:
r = urllib.request.urlopen(pg)
soup = BeautifulSoup(r,"html.parser")
soup.find_all("a", {"class": "cue"})
But I get nothing, also when I search for the response field:
responseList = soup.findAll('p', attrs={ "class" : "response"})
for word in responseList:
print(word)
I tried moving down the tree by finding children but couldn’t get to the text I want. I will be grateful for your help. Here are the fields I'm trying to extract:
After great help from jxpython, I've now stumbed upon a new challenge (perhaps this should be a new thread, but it's quite related, so maybe it's OK here). My goal is to create a dataframe or a csv file, each row containing a Japanese word, translation and examples with transliterations. With the lists created using:
driver.find_elements_by_class_name()
driver.find_elements_by_xpath()
I get lists with different number of element, so it's not possible to easily creatre a dataframe.
# len(cues) 100
# len(responses) 100
# len(transliterations)279 stramge number because some words don't have transliterations
# len(texts) 200
# len(translations)200
The transliterations lists contains a mix of transliterations for single words and sentences. I think to be able to get content to populate the first line of my dataframe I would need to loop through the
<li class="item">
content (xpath? #/html/body/div2/div/div/section/div/section/div/div/ul/li1) and for each extract the word with translation, sentences and transliteration...I'm not sure if this would be the best approach though...
As an example, the information I would like to have in the first row of my dataframe (from the box highlighted in screenshot) is:
行く, いく, go, 日曜日は図書館に行きます。, にちようび は としょかん に いきます。, I go to the library on Sundays.,私は夏休みにプールに行った。, わたし は なつやすみ に プール に いった。, I went to the pool during summer vacation.