How to scrape HTML from TXT and store all items to CSV?

Question

I am trying to export tag items from HTMLon a TXTfile. For some reason my code is only taking the last line and exporting it to the CSV. It won't scrape the other listed items. Not sure why. I tried multiple solutions but nothing.

Here is my code...

import pandas as pd
from bs4 import BeautifulSoup
import schedule
import time
#import urllib.parse
import requests


baseurl = 'https://www.soxboxmtl.com'

dataset = []

with open(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate%20share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/soxboxmtl2.txt', "r") as f:

        
        soup = BeautifulSoup(f.read(), "html.parser")
        for imgurl in soup.find_all('img', class_='grid-item-image'):(imgurl['data-src'])
        for name in soup.find_all('div', class_='grid-title'):(name.text)    
        for link in soup.find_all('a', class_='grid-item-link'):(link['href'])  
        for price in soup.find_all('div', class_='product-price'):(price.text)
       
        dataset.append({'Field_01':(imgurl['data-src']),'Field_02':name.text,'Field_03':(baseurl + link['href']),'Field_04':price.text})
        
        print(dataset)

        df = pd.DataFrame(dataset).to_csv(r'/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate%20share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/soxboxmtl2.csv', index = False)

Here is a sample of HTML data

<div class="grid-item hentry tag-paddle tag-brush tag-bristle tag-wide tag-detangle tag-kitsch tag-anti-frizz tag-black author-jill-kessner post-type-store-item article-index-45 sqs-product-quick-view-button-hover-area" data-controller="ProductListImageLoader" data-item-id="625ef30d651884142d5a2dc2" id="thumb-kitsch-paddle-hair-brush">
    <a aria-label="Kitsch Paddle Hair Brush" class="grid-item-link" href="/home-bath-body/p/kitsch-paddle-hair-brush">
    </a>
    <figure class="grid-image" data-animation-role="image" data-test="plp-grid-image">
    <div class="grid-image-wrapper has-hover-img">
    <img alt="Screenshot 2022-04-19 at 1.31.04 PM.png" class="grid-item-image grid-image-cover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot+2022-04-19+at+1.31.04+PM.png" data-image-dimensions="1341x1335" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot+2022-04-19+at+1.31.04+PM.png"/>
    <img alt="Screenshot 2022-04-19 at 1.31.24 PM.png" class="grid-item-image grid-image-hover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390381627-ZJU6GL0JVR2AZG3FKM84/Screenshot+2022-04-19+at+1.31.24+PM.png" data-image-dimensions="1338x1338" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390381627-ZJU6GL0JVR2AZG3FKM84/Screenshot+2022-04-19+at+1.31.24+PM.png"/>
    <div class="list-quick-view-wrapper sqs-product-quick-view-button-wrapper">
    <span class="sqs-product-quick-view-button" data-group="5ec69b56a188e3129c377b33" data-id="625ef30d651884142d5a2dc2" role="button" tabindex="0">Quick View</span>
    </div>
    </div>
    </figure>
    <section class="grid-meta-wrapper" data-animation-role="content">
    <div class="grid-main-meta">
    <div class="grid-title" data-test="plp-grid-title">
            Kitsch Paddle Hair Brush
          </div>
    <div class="grid-prices" data-test="plp-grid-prices">
    <div class="product-price">
    CA$24.00
    </div>
    </div>
    </div>
    <div class="grid-meta-status" data-test="plp-grid-status">
    <div class="product-scarcity">
        Only 2 left in stock
      </div>
    </div>
    </section>
    </div>
    <div class="grid-item hentry tag-blanket tag-plush tag-cozy-plush tag-pj-salvage tag-embroidered tag-blush tag-pink tag-luxe-plush tag-luxe author-jill-kessner post-type-store-item article-index-46 sqs-product-quick-view-button-hover-area" data-controller="ProductListImageLoader" data-item-id="635031c65ac9872b4ba44f5a" id="thumb-pj-salvage-luxe-plush-embroidered-blanket-blush">
    <a aria-label="PJ Salvage Luxe Plush Embroidered Blanket - Blush" class="grid-item-link" href="/home-bath-body/p/pj-salvage-luxe-plush-embroidered-blanket-blush">
    </a>
    <figure class="grid-image" data-animation-role="image" data-test="plp-grid-image">
    <div class="grid-image-wrapper has-hover-img">
    <img alt="Screenshot 2022-10-17 at 12.03.06 AM.png" class="grid-item-image grid-image-cover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot+2022-10-17+at+12.03.06+AM.png" data-image-dimensions="891x1340" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot+2022-10-17+at+12.03.06+AM.png"/>
    <img alt="Screenshot 2022-10-17 at 12.02.56 AM.png" class="grid-item-image grid-image-hover" data-image="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200171128-41WP2X90CW820GH07IPH/Screenshot+2022-10-17+at+12.02.56+AM.png" data-image-dimensions="890x1339" data-image-focal-point="0.5,0.5" data-load="false" data-src="https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200171128-41WP2X90CW820GH07IPH/Screenshot+2022-10-17+at+12.02.56+AM.png"/>
    <div class="list-quick-view-wrapper sqs-product-quick-view-button-wrapper">
    <span class="sqs-product-quick-view-button" data-group="5ec69b56a188e3129c377b33" data-id="635031c65ac9872b4ba44f5a" role="button" tabindex="0">Quick View</span>
    </div>
    </div>
    </figure>
    <section class="grid-meta-wrapper" data-animation-role="content">
    <div class="grid-main-meta">
    <div class="grid-title" data-test="plp-grid-title">
            PJ Salvage Luxe Plush Embroidered Blanket - Blush
          </div>
    <div class="grid-prices" data-test="plp-grid-prices">
    <div class="product-price">
    CA$118.00
    </div>
    </div>
    </div>
    <div class="grid-meta-status" data-test="plp-grid-status">
    <div class="product-scarcity">
        Only 1 left in stock
      </div>
    </div>

Could you provide a sample of the HTML data you're trying to parse? — tetraphobia, Jan 06 '23 at 02:02

tetraphobia · Accepted Answer · 2023-01-06T08:03:40.970

There are two problems with your current implementation:

Problem 1

Your loops do not actually do anything with the data that bs4 finds. The only thing adding data to your data set is the single call to dataset.append(), which results in the single line of data you experienced.

Problem 2

Even if the loops were functional, the script would likely fail because of pandas DataFrames requiring a consistent column length. For example, there are more images than there are titles, so you will end up with columns of varying length.

Solution

Besides making sure that we're actually appending data correctly, we need to ensure that all columns are formatted correctly and consistently. Rather than searching for any and all information with no relation to each other, we instead search for all parent elements that contain the information relating to our needs.

We then iterate over the list of parent elements. Inside each iteration, we search only that parent element for usable data, then format it for use in a DataFrame. This DataFrame is appended to our list of DataFrames, which is concatenated into a single DataFrame once the iterations are done, and finally exported.

# Find all the grid-items first.
sections = soup.find_all('div', {'class': 'grid-item'}, recursive=True)

# We will append our formatted data to this list, then
# provide it to the DataFrame on creation
df_items = []

# Format and add the data from each grid-item to the DataFrame.
for section in sections:
    title = section.find('a', {'class': 'grid-item-link'})
    imgs = section.findAll('img')
    price = section.find('div', {'class': 'product-price'})

    data = {
        'Field_01': [img['data-src'] for img in imgs],
        'Field_02': [title['aria-label']],
        'Field_03': [baseurl + title['href']],
        'Field_04': [''.join(price.text.split())],
    }

    # DataFrames require all arrays to be the same length.
    # This automatically fills in any missing cells.
    df = pd.DataFrame.from_dict(data, orient='index')
    df = df.transpose()

    # Append the DataFrame to our list of DataFrames.
    df_items.append(df)

# Concatenate all dataframes.
result = pd.concat(df_items)

# Export
result.to_csv('data.csv', index=False)

Ok but I'm a little confused. What would the 2nd field (Field_02) look like? — sammyb62, Jan 06 '23 at 04:01
You can substitute the index of df with 'Field_02' and then assign it the list of items you want in it. I will update the post with a more complete solution. — tetraphobia, Jan 06 '23 at 04:57
Hello tetraphobia, thanks for your help. Your code works as well. I appreciate it. — sammyb62, Jan 17 '23 at 00:33

HedgeHog · Answer 2 · 2023-01-06T07:26:42.317

It is because the for-loops go through, but always overwrite the values, so that only the last value remains, which is then added to the dataset.

Recommendation - Try to simplify and orient yourself to the container element with class grid-item that contains the information, iterate over all these containers and then add the data to your dataset. This way you only need a single for-loop, which is easier to control.

Following example uses css selectors as I prefer to work with these:

...
soup = BeautifulSoup(f.read(), "html.parser")
for e in soup.select('.grid-item'):
    dataset.append({
        'Field_01':e.img.get('data-src'),
        'Field_02':e.select_one('.grid-title').get_text(strip=True),
        'Field_03':baseurl + e.a.get('href'),
        'Field_04':e.select_one('.product-price').get_text(strip=True)
    })

but you can use find_all() or find() instead as well. Check also get_text() and its parameters, to get rid of breaks or whitespaces.

for e in soup.find_all('div', class_='grid-item'):
        dataset.append({
            'Field_01':e.find('img', class_='grid-item-image').get('data-src'),
            'Field_02':e.find('div', class_='grid-title').get_text(strip=True),
            'Field_03':baseurl + e.find('a', class_='grid-item-link').get('href'),
            'Field_04':e.find('div', class_='product-price').get_text(strip=True)
        })

This will lead to:

Field_01	Field_02	Field_03	Field_04
https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1650390361257-FA4PYOB3KLXRT69ME502/Screenshot+2022-04-19+at+1.31.04+PM.png	Kitsch Paddle Hair Brush	https://www.soxboxmtl.com/home-bath-body/p/kitsch-paddle-hair-brush	CA$24.00
https://images.squarespace-cdn.com/content/v1/5eb9807914392e1510a400ed/1666200149369-QJ9BN6T3KE45I2H11K9Z/Screenshot+2022-10-17+at+12.03.06+AM.png	PJ Salvage Luxe Plush Embroidered Blanket - Blush	https://www.soxboxmtl.com/home-bath-body/p/pj-salvage-luxe-plush-embroidered-blanket-blush	CA$118.00

Hello HedgeHog, thanks for your help. Your code works well. I appreciate it. — sammyb62, Jan 17 '23 at 00:33
Hello HedgeHog, just a note. I do notice that for some reason the second piece of code you suggested only returns a fraction of the results. In one case I have 1600 products, but the returned result is 29. Any idea why this happens? — sammyb62, Jan 17 '23 at 10:28
I get an error message, **Traceback (most recent call last): File "/run/user/759001103/gvfs/smb-share:server=192.168.0.112,share=corporate%20share/Corporate Share/Systems and Infrastructure/Engineering/jbot tests/testing.py", line 125, in 'Field_01':protocol + e.find('img', class_='grid-product__image lazyloaded').get('src'), AttributeError: 'NoneType' object has no attribute 'get'** even though it returned x amount of Field_01, in this case, 29 — sammyb62, Jan 17 '23 at 10:37
Simply check if your element is available ...`e.find('img', class_='grid-product__image lazyloaded').get('src') if e.find('img', class_='grid-product__image lazyloaded') else None` But this would be predestined for [asking a new question](https://stackoverflow.com/questions/ask) with exact this focus. — HedgeHog, Jan 17 '23 at 10:50

How to scrape HTML from TXT and store all items to CSV?

2 Answers2

Problem 1

Problem 2

Solution