0

I'm trying to scrape an ecommerce store but getting Attribute error: nonetype object has no attribute get_text. This happens whenever i try to iterate between each products through the product link. I'm confused if am running into a javascript or captcha or whatnot don't know. Here's my code

import requests
from bs4 import BeautifulSoup

baseurl = 'https://www.jumia.com'

headers = {
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}

productlinks = []

for x in range(1,51):
    r = requests.get(f'https://www.jumia.com.ng/ios-phones/?page={x}#catalog-listing/')
    soup = BeautifulSoup(r.content, 'lxml')

    productlist = soup.find_all('article', class_='prd _fb col c-prd')

    for product in productlist:
        for link in product.find_all('a', href=True):
            productlinks.append(baseurl + link['href'])
           
for link in productlinks:
    r = requests.get(link, headers = headers)
    soup = BeautifulSoup(r.content, 'lxml')
    
    name = soup.find('h1', class_='-fs20 -pts -pbxs').get_text(strip=True)
    amount = soup.find('span', class_='-b -ltr -tal -fs24').get_text(strip=True)
    review = soup.find('div', class_='stars _s _al').get_text(strip=True)
    rating = soup.find('a', class_='-plxs _more').get_text(strip=True)
    features = soup.find_all('li', attrs={'style': 'box-sizing: border-box; padding: 0px; margin: 0px;'})
    a = features[0].get_text(strip=True)
    b = features[1].get_text(strip=True)
    c = features[2].get_text(strip=True)
    d = features[3].get_text(strip=True)
    e = features[4].get_text(strip=True)
    f = features[5].get_text(strip=True)

    
    print(f"Name: {name}")
    print(f"Amount: {amount}")
    print(f"Review: {review}")
    print(f"Rating: {rating}")

    print('Key Features')
    print(f"a: {a}")
    print(f"b: {b}")
    print(f"c: {c}")
    print(f"d: {d}")
    print(f"e: {e}")
    print(f"f: {f}")
               
    print('') 

Here's the error message:

Traceback (most recent call last):
  File "c:\Users\LP\Documents\jumia\jumia.py", line 32, in <module>       
    name = soup.find('h1', class_='-fs20 -pts -pbxs').get_text(strip=True)
AttributeError: 'NoneType' object has no attribute 'get_text'
PS C:\Users\LP\Documents\jumia>  here
Miracle
  • 89
  • 7
  • 1
    Welcome to StackOverflow. Please post the full traceback of the error. Also, please fix your formatting; some of your code is outside of the code block. The full error should point you at which line of code caused the problem. The problem is either that you are trying to call get_text on a variable which is None, or on the result of soup.find which might return None. – Nathan Mills Oct 24 '22 at 04:21
  • @NathanMills I've edited it. Now you can see the full traceback error message. pls help me get this right. Thanks –  Miracle Oct 24 '22 at 04:51
  • There's no

    (heading) element with a class of "-fs20 -pts -pbxs" then.

    – Nathan Mills Oct 24 '22 at 05:00
  • but when I inspect the 'Jumia' website, I can find this

    with it's class attribute "-fs20 -pts -pbxs". It's there for sure. @Nathan Mills

    –  Miracle Oct 24 '22 at 05:10
  • Maybe they changed the class attribute. When I search

    on that page in Edge Devtools, the only result is `

    iOS Phones

    `. I tried searching for the same tag in Firefox Devtools but its search is bad. Does your \script still give an error if you change line 32 to `name = soup.find('h1', class_='-fs20 -m -elli -phs').get_text(strip=True)`?
    – Nathan Mills Oct 24 '22 at 05:30
  • I just inspected the page again and found out this '== $0' existed exactly after the line of code am trying to pull out the 'name' information from. Here it is;

    IPhone X 3GB RAM+64GB(Renewed) -Black

    ==$0. Any idea what it means @Nathan Mills
    –  Miracle Oct 24 '22 at 08:04
  • Sorry, I was looking at the wrong page. I see the

    with `-fs20 -pts -pbxs` class now. Are you sure the `soup` variable contains the HTML from the right page at line 32? Perhaps the indentation of your code is incorrect, which can cause `soup` to be a different variable than you expect, since you seem to have multiple `soup` variables. About the `==$0`, that's just something Chrome adds to the element you select, see https://stackoverflow.com/questions/36999739/what-does-0-double-equals-dollar-zero-mean-in-chrome-developer-tools

    – Nathan Mills Oct 24 '22 at 22:53
  • hello @Nathan Mills, I really appreciate your patience. Unfortunately I just tried different Indentation and still getting Attribute Error. Maybe you could check it out if it works for you. I'm so stuck here right now –  Miracle Oct 25 '22 at 06:09
  • I ran your code under the Python debugger, `pdb`, and Python gives me the same error you're getting. I printed out the `soup` variable and it looks like the page it's getting is the "select your country" page instead of the product page. Try changing the variable `baseurl` to `https://www.jumia.com.ng` or one of the other country-specific Jumia domains (`jumia.com.foo` where `foo` is the country code). – Nathan Mills Oct 25 '22 at 23:51
  • Hi @Nathan Mills, you're a genius. Attribute error solved but unfortunately now am getting IndexError. see it Traceback (most recent call last): File "c:\Users\LP\Documents\jumia\jumia.py", line 36, in a = features[0].get_text(strip=True) IndexError: list index out of range PS C:\Users\LP\Documents\selen> –  Miracle Oct 26 '22 at 05:39
  • Also just incase of next time so i would know the right way to debug, how do I run code under Python debugger? Which of the soup variable did you print out cos I did print(soup) for both soup variables but didn't get the "select your country" page. If you could demonstrate with a code block might help me understand more. Thanks –  Miracle Oct 26 '22 at 06:05
  • To run your script under the Python debugger, do `python -m pdb jumia.py` from the command-line (not the Python prompt) when you're in the same directory as the script or add the line `import pdb;pdb.set_trace()` at the top of your script. – Nathan Mills Oct 27 '22 at 04:56

1 Answers1

1

Change the variable baseurl to https://www.jumia.com.ng and change the features variable to features = soup.find('article', class_='col8 -pvs').find_all('li'). After fixing those two issues, you'll probably get an IndexError because not every page has six features listed. You can use something like the following code to iterate through the features and print them:

for i, feature in enumerate(features):
        print(chr(ord("a")+i) + ":", feature.get_text(strip=True))

With this for loop, you don't need the a to f variables. The chr(ord("a")+i part gets the letter corresponding to index i. However, if there are more than 26 features this will print punctuation characters or garbage. This can be trivially fixed by breaking the loop when i>25. This trick won't work on EBCDIC systems, only ASCII ones.

Even after making these three changes, there was an AttributeError when it tried to scrape a link to a product unrelated to iPhones, which showed up on page 5 of the results. I don't know how the script got that link; it was a medicinal cream. To fix that, either wrap the body of the second for loop in a try except like the following or put the last line of the first for loop under a if 'iphone' in link.

for link in productlinks:
try:
    # body of for loop goes here
except AttributeError:
    continue

With these changes, the script would look like this:

import requests
from bs4 import BeautifulSoup

baseurl = 'https://www.jumia.com.ng'

headers = {
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}

productlinks = []

for x in range(1,51):
    r = requests.get(f'https://www.jumia.com.ng/ios-phones/?page={x}#catalog-listing/')
    soup = BeautifulSoup(r.content, 'lxml')

    productlist = soup.find_all('article', class_='prd _fb col c-prd')

    for product in productlist:
        for link in product.find_all('a', href=True):
            if 'iphone' in link['href']:
                productlinks.append(baseurl + link['href'])
           
for link in productlinks:
    r = requests.get(link, headers = headers)
    soup = BeautifulSoup(r.content, 'lxml')

    try:
        name = soup.find('h1', class_='-fs20 -pts -pbxs').get_text(strip=True)
        amount = soup.find('span', class_='-b -ltr -tal -fs24').get_text(strip=True)
        review = soup.find('div', class_='stars _s _al').get_text(strip=True)
        rating = soup.find('a', class_='-plxs _more').get_text(strip=True)
        features = soup.find('article', class_='col8 -pvs').find_all('li')
        
        print(f"Name: {name}")
        print(f"Amount: {amount}")
        print(f"Review: {review}")
        print(f"Rating: {rating}")

        print('Key Features')
        for i, feature in enumerate(features):
            if i > 25: # we ran out of letters
                break
            print(chr(ord("a")+i) + ":", feature.get_text(strip=True))
                   
        print('')
    except AttributeError:
        continue
Nathan Mills
  • 2,243
  • 2
  • 9
  • 15