1

I am writing a webpage scraper to collate sentences in Japanese. My source utilises so-called furigana, which appears as characters above kanji to indicate the Kanji's pronunciation. I do not want this furigana to appear in my scraped sentences.

The website's source html looks something like (source: https://www3.nhk.or.jp/news/easy/k10014010651000/k10014010651000.html):

<article class = "article-main">
<p>
<span class="colorB">16</span><span class="color4"><ruby>日<rt>にち</rt></ruby></span>
<span class="colorB">、</span><span class="colorL"><ruby>韓国<rt>かんこく</rt></ruby
</span>
</p>
</article>

Which shows the characters between にち above the character 日 and かんこく above 韓国.

I current scrape the article-main element, and use get_attribute("innerText") to separate the article text, as follows:

element = browser.find_element(By.CLASS_NAME, "article-main")
article = element.get_attribute("innerText")
print(article)

However this outputs the furigana after the kanji within the sentences, so I end up with an output that looks like 16日にち、韓国かんこく instead of 16日、韓国. How can I remove the contents between ?

I have tried finding "rt" tag names, and replacing with "" as below:

element = browser.find_element(By.CLASS_NAME, "article-main")
html = element.get_attribute("innerHTML")
furigana = element.find_elements(By.TAG_NAME, "rt")
print(element.innerText.replace(furigana.innerText, ''))

But, the Webelement object has no innerText attribute. What approach can I take to isolate and remove the rt elements using Python?

hpbristol
  • 13
  • 3
  • Try `browser.execute_script("arguments[0].remove()", furigana)`, then read `element.get_attribute("innerText")` – Unmitigated Mar 19 '23 at 21:05
  • Assuming I understood you correctly: `element = browser.find_element(By.CLASS_NAME, "article-main") html = element.get_attribute("innerHTML") print(html) furigana = element.find_elements(By.TAG_NAME, "rt") browser.execute_script("arguments[0].remove()", furigana) print(element.get_attribute("innerText")) `returns an error "Javascript error: arguments[0].remove is not a function" – hpbristol Mar 19 '23 at 21:13
  • Can you print out what `furigana` is before that? – Unmitigated Mar 19 '23 at 21:16
  • The result from the console looks like `[, – hpbristol Mar 19 '23 at 21:19
  • Ok, try `browser.execute_script("for(const el of arguments[0]) el.remove();", furigana)` – Unmitigated Mar 19 '23 at 21:20
  • `furigana = element.find_elements(By.TAG_NAME, "rt") browser.execute_script("for(const el of arguments) el.remove();", furigana) print(element.get_attribute("innerText"))` now returns the error "javascript error: el.remove is not a function" – hpbristol Mar 19 '23 at 21:22
  • Replace `arguments` with `arguments[0]`. I edited my comment above. – Unmitigated Mar 19 '23 at 21:22
  • That worked - thank you! May I ask for a brief explanation of the code please? – hpbristol Mar 19 '23 at 21:26
  • I've posted that as an answer now. Essentially, `furigana` is passed as the first argument to the script (`arguments[0]`). Then, we loop over each element in that array with `for...of`, using `.remove()` to delete each one. – Unmitigated Mar 19 '23 at 21:28

1 Answers1

0

You can use JavaScript to remove each of the <rt> elements.

furigana = element.find_elements(By.TAG_NAME, "rt")
browser.execute_script("for (const el of arguments[0]) el.remove();", furigana)

After that, you can read the innerText of the element.

article = element.get_attribute("innerText")
Unmitigated
  • 76,500
  • 11
  • 62
  • 80