How to parse raw html element in R or Python?

Question

For instance in this website: https://www.amazon.com/Lexani-LXUHP-207-All-Season-Radial-Tire-245/dp/B07FFH8F9V/

So I say "inspect" and I find the element that I'm interested:

<span id="productTitle" class="a-size-large product-title-word-break">        Lexani LXUHP-207 Performance Radial Tire - 245/45R18 100W       </span>

Here's the deal, I want to copy the entire thing. Not just the "Lexani LXUHP-207 Performance Radial Tire - 245/45R18 100W" text title of the product. Can someone tell me how can I do this in beatifulsoup or rvest?

I am learning Python and R and I tried to dig it in but couldn't get a raw result.

What have you tried? This is straightforward in both Python and R, and in fact it requires (slightly) *more* effort to obtain just the text than the entire tag, so I am confused as to what exactly the issue is. — Konrad Rudolph, Nov 02 '22 at 10:04

score 0 · Answer 1 · answered Nov 02 '22 at 06:57

there will be problems with captcha on amazon, but if you beat it you can get what you want by

import requests
from bs4 import BeautifulSoup

the_entire_thing = BeautifulSoup(requests.get('https://www.amazon.com/Lexani-LXUHP-207-All-Season-Radial-Tire-245/dp/B07FFH8F9V/').text, 'lxml').find(id='productTitle')

score 0 · Answer 2 · answered Nov 02 '22 at 10:01

In R you can just convert the node to a character vector:

library(rvest)
html <- minimal_html('<span id="productTitle" class="a-size-large product-title-word-break">        Lexani LXUHP-207 Performance Radial Tire - 245/45R18 100W       </span>')
html_node <- html_element(html, "#productTitle") 
as.character(html_node)
#> [1] "<span id=\"productTitle\" class=\"a-size-large product-title-word-break\">        Lexani LXUHP-207 Performance Radial Tire - 245/45R18 100W       </span>"

^{Created on 2022-11-02 with reprex v2.0.2}

How to parse raw html element in R or Python?

2 Answers2