1

i'm trying to scrape some data from a website but the final result have the output data in lists, so how can i extract the data without those list brackets.

enter image description here

The Original Code:-

user_input =  'ios-phones'#input('Please Enter Your Favorite Item:- ')
try:
    data_list = []
    for i in range(1,30):

        url = f'https://www.jumia.com.eg/{user_input}/?page={i}#catalog-listing'
        page = requests.get(url).content
        soup = BeautifulSoup(page,'lxml')
        phones = soup.find('div',class_='-paxs row _no-g _4cl-3cm-shs')
        phones_info = phones.find_all('article',class_=True)

        for i in phones_info:
            try:
                title = i.select('.name')[0].text.strip()
                current_price = i.select('.prc')[0].text
                old_price = i.find('div',class_='old')
                rating = i.find('div',class_='stars')
            except:
                pass

            row = {'Phone Title':title,'Current Price':current_price,'Old Price':old_price,'Rating':rating}
            data_list.append(row)
except:
    pass
df = pd.DataFrame(data_list)
df
Mahmoud Badr
  • 303
  • 1
  • 11

2 Answers2

1

Main issue here seems to be that you append the bs4 objects for old_price and rating not its texts as you do with the first two title and current_price - So change to:

for i in phones_info:
    title = i.select_one('.name').get_text(strip=True)
    current_price = i.select_one('.prc').get_text(strip=True)
    old_price = i.select_one('.old').get_text(strip=True) if i.select_one('.old') else None
    rating = i.select_one('.stars').get_text(strip=True) if i.select_one('.stars') else None

    row = {'Phone Title':title,'Current Price':current_price,'Old Price':old_price,'Rating':rating}
    data_list.append(row)

Output

Phone Title Current Price Old Price Rating
0 Apple Iphone 14 Pro Max – 5G Single SIM – 256/6GB RAM – Deep Purple EGP 54,499.00 EGP 70,000.00 None
1 Apple IPhone 13 Single SIM With FaceTime - 128GB - Pink EGP 29,999.00 EGP 50,000.00 5 out of 5
2 Apple IPhone 13 Pro Max Single SIM With FaceTime - 512GB - AlpineGreen EGP 55,499.00 EGP 65,000.00 None
3 Apple IPhone 12 Mini With FaceTime - 128GB - Blue EGP 25,900.00 None 4.7 out of 5
4 Apple IPhone 12 With FaceTime - 128GB - Purple EGP 27,900.00 None 4.2 out of 5
...
35 Apple Iphone 13 128G Green EGP 31,900.00 None None
36 Apple IPhone 13 / 512GB / Pink EGP 40,900.00 None None
37 Apple IPhone 13 (128GB) Red EGP 31,900.00 None None
38 Apple IPhone 13 Pro Single SIM With FaceTime - 128GB - Sierra Blue EGP 41,900.00 None None
39 Apple IPhone 13 128GB Starlight EGP 31,900.00 None None
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • i tried that but the output not accurate and the None rows take the same value for the pervious row, please check the old_ price and rating output again – Mahmoud Badr Feb 21 '23 at 13:19
  • Focus of OP is *how can i extract the data without those list brackets* and that is the solution for - If you check the source of the page you will see that you get what you are searching for, still if it is not visible rendered to the screen. – HedgeHog Feb 21 '23 at 14:15
  • the issue in the above table for example is in index 3 & 4, the old price for them should be None and you can make sure from that in the original page, but the soultion that you have provided take the value of index 2 and repeat in in index 3 then the same matter happened in index 4 – Mahmoud Badr Feb 21 '23 at 14:29
  • Depends on how it is accessed, get different information depending on the point. Maybe next time you could provide some example structure. Anyway, the revision should correspond to both the OP and the annotation. – HedgeHog Feb 21 '23 at 18:18
  • Many thanks for your support, your last soultion has been working for me.. – Mahmoud Badr Feb 21 '23 at 21:53
0

You can use re.sub to subsitute the brackets in the string for blanks as such:

import re

old_price = re.sub("\[|\]", "", old_price)
rating = re.sub("\[|\]", "", rating)

Sample input:

old_price = "[EGP 50,000.00]"
rating = "[5 out of 5, []]"

Output:

EGP 50,000.00
5 out of 5, 
B Remmelzwaal
  • 1,581
  • 2
  • 4
  • 11