0

I am scraping reviews from a website (which I've done before and worked fine) but now, I see if the review includes an image, an additional span tag for the reviewer's name is being scraped. So, if the review includes an image, I get the reviewer's name twice. None of the other content (review title, review content, etc.) is being scraped twice.

I see the additonal tags in the output of: soup = bs(driver.page_source, 'html.parser')

I was expecting one reviewer's name because there is only one review by that user. I am not able to create a file of reviews because the length of the columns (reviewer name, review title, review content, etc.) do not match in length.

What is the best way to ignore the additional tags?

<div class="a-profile-avatar"><img class="a-lazy-loaded" data-src="https://images-na.ssl-images-amazon.com/images/S/amazon-avatars-global/2dd427f5-f954-4dba-8b20-98be896df084._CR83,0,333,333_SX48_.jpg" src="https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/grey-pixel.gif"/><noscript><img src="https://images-na.ssl-images-amazon.com/images/S/amazon-avatars-global/2dd427f5-f954-4dba-8b20-98be896df084._CR83,0,333,333_SX48_.jpg"/></noscript></div></div><div class="a-profile-content"><span class="a-profile-name">Just Jo</span></div></a></div><div class="a-row"><a class="a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold" data-hook="review-title" href="/gp/customer-reviews/R15O5KMSCX5H4H/ref=cm_cr_arp_d_rvw_ttl?ie=UTF8&amp;ASIN=B08GJK3KW9"><i class="a-icon a-icon-star a-star-4 review-rating" data-hook="review-star-rating"><span class="a-icon-alt">4.0 out of 5 stars</span></i><span class="a-letter-space"></span>
<span>Seems like a good case</span>
</a></div><span class="a-size-base a-color-secondary review-date" data-hook="review-date">Reviewed in the United States  on July 23, 2021</span><div class="a-row a-spacing-mini review-data review-format-strip"><a class="a-link-normal" href="/gp/help/customer/display.html/ref=cm_cr_dp_d_rvw_avp?nodeId=G75XTB7MBMBTXP6W" rel="noopener" target="_blank"><span class="a-size-mini a-color-state a-text-bold" data-hook="avp-badge">Verified Purchase</span></a></div><div class="a-row a-spacing-small review-data"><span class="a-size-base review-text review-text-content" data-hook="review-body">
<span>I got this so my daughter could have a place to keep a toothbrush, bands, and her aligners during lunch at school. Yes, it's much larger than her normal carry case meant just for the aligners, it has to be if you want it to fit a full-sized toothbrush. The magnetic closers seem sturdy and I feel confident it will stay closed in her lunch tote. Will try to update after she's had a chance to use it for a bit.</span>
</span></div><div class="a-popover-preload" id="a-popover-R15O5KMSCX5H4H_gallerySection_main">
<div class="a-section cr-lightbox-popover-container" data-hook="image-popover" id="R15O5KMSCX5H4H_image_popover">
<div class="cr-lightbox-image-viewer">
<div class="cr-lightbox-main-image-container">
<img alt="Customer image" class="cr-lightbox-main-image" src="https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif"/>
</div>
<div class="cr-lightbox-navigator-container cr-lightbox-navigator-container__back">
<div class="cr-lightbox-navigator-button cr-lightbox-navigator-button__back">
</div>
</div>
<div class="cr-lightbox-navigator-container cr-lightbox-navigator-container__next">
<div class="cr-lightbox-navigator-button cr-lightbox-navigator-button__next">
</div>
</div>
</div>
<div class="a-section cr-lightbox-review-information">
<div class="a-section a-spacing-mini cr-review-stars-and-title">
<div class="a-row a-spacing-mini">
<a class="a-profile cr-lightbox-customer-profile" data-a-size="small" href="/gp/profile/amzn1.account.AHQHUGP5H3BDJ4SBWS6T34N7F2XA/ref=cm_cr_arp_d_gw_pop?ie=UTF8"><div aria-hidden="true" class="a-profile-avatar-wrapper"><div class="a-profile-avatar"><img class="" data-src="https://images-na.ssl-images-amazon.com/images/S/amazon-avatars-global/2dd427f5-f954-4dba-8b20-98be896df084._CR83,0,333,333_SX48_.jpg" src="https://images-na.ssl-images-amazon.com/images/S/amazon-avatars-global/2dd427f5-f954-4dba-8b20-98be896df084._CR83,0,333,333_SX48_.jpg" style=""/><noscript><img src="https://images-na.ssl-images-amazon.com/images/S/amazon-avatars-global/2dd427f5-f954-4dba-8b20-98be896df084._CR83,0,333,333_SX48_.jpg"/></noscript></div></div><div class="a-profile-content"><span class="a-profile-name">Just Jo</span></div></a>
</div>
<i class="a-icon a-icon-star a-star-4 cr-lightbox-review-rating"><span class="a-icon-alt">4.0 out of 5 stars</span></i>
<span class="a-size-base cr-lightbox-review-title a-text-bold">
                Seems like a good case
            </span>
<br/>
<span class="a-size-small a-color-secondary cr-lightbox-review-origin">
Ajeet Verma
  • 2,938
  • 3
  • 13
  • 24
Jon
  • 1
  • 1
  • 1
    Can you provide an example of HTML code for a review that contains an additional span with a duplicated reviewer's name? – ands Jul 09 '23 at 16:45
  • the site is saying that i'm not allowed to post images :( – Jon Jul 09 '23 at 20:18
  • You don't have to post images, just HTML code as text. – ands Jul 09 '23 at 20:41
  • i've added some code where you can see the span class a-profile-name "Just Jo" twice. thx! – Jon Jul 10 '23 at 00:01
  • Are you sure you copied the code correctly because it has a lot of unmatched (unopened and unclosed) HTML tags? (You can check with the [W3C Markup Validator](https://validator.w3.org/#validate_by_input) by [filtering error messages](https://i.imgur.com/xWwFjkN.png), so it shows only errors about unmatched tags.) I understand that you probably have no control over the bad code in the page that you are scraping, but because there are so many errors, I am just checking that you copied it correctly. If you are scraping a publicly available webpage, could you provide the URL? – ands Jul 10 '23 at 16:28
  • oh sorry, this was the soup object output. not sure if it's the same as html code. – Jon Jul 10 '23 at 20:55
  • That should have been OK, never mind, does my answer solve your problem? – ands Jul 10 '23 at 21:45

2 Answers2

1

I figured out what page you are scraping. It is not clear from your question how you are scraping the page, but the problem is that Amazon has duplicated the content of each review in a div element with class a-popover-preload which I assume they are using before the main review information loads. Anyway, the whole review (name, rating, text, place and date) is actually duplicated, but you only get the duplicated reviewer's name because both spans (the original span and the duplicated span) that contain the reviewer's name have the same class and other duplicated elements have slightly different classes. The solution is to select only one (the first one) span, given that you haven't added any code to your question, here is how I scraped review information without duplication:

import requests
from selenium import webdriver
import bs4


 
url = 'https://www.amazon.com/OrthoKey-OrthoPod-Aligners-Toothbrush-Toothpaste/dp/B08GJK3KW9'


driver = webdriver.Chrome()
driver.get(url)
soup = bs4.BeautifulSoup(driver.page_source, 'html.parser')


reviews = soup.select('.review ')
jo_review = [r for r in reviews if 'Just Jo' in r.get_text()][0]

review_container = jo_review
reviewer_name_span = review_container.select_one('span.a-profile-name')
reviewer_name = reviewer_name_span.get_text(strip=True)
review_rating_span = review_container.select_one('i.review-rating > span')
review_rating = review_rating_span.get_text(strip=True)
review_title_spans = review_container.select('a.review-title > span')
review_title_span = [span for span in review_title_spans if
                     span.get_text(strip=True) != ''][0]
review_title = review_title_span.get_text(strip=True)    
review_loaction_and_data_span = review_container.select_one('span.review-date')
review_loaction_and_data = review_loaction_and_data_span.get_text(strip=True)
review_text_span = review_container.select_one('div.review-text-content > span') 
review_text = review_text_span.get_text(strip=True)  


print(reviewer_name)
print(review_rating)
print(review_title)
print(review_loaction_and_data)
print(review_text)

P.S. when copying HTML code, it is best to use the Copy element option in Developer tools or if you are using Beautiful Soup, you can get the code of element (in code above it is review_container) with print(element.prettify()).

ands
  • 1,926
  • 16
  • 27
  • so originally I was scraping the "a-profile-name" span class but I think only the reviews with images have a duplicate span. i'm not the best at coding so i'll have to check out what you provided. – Jon Jul 10 '23 at 23:49
0

How about using a CSS selector?

reviewer_names = [tag.text for tag in soup.select('span.reviewer-name')]
Norhther
  • 545
  • 3
  • 15
  • 35