how to remove list brackets from the web scraped data

Question

i'm trying to scrape some data from a website but the final result have the output data in lists, so how can i extract the data without those list brackets.

The Original Code:-

user_input =  'ios-phones'#input('Please Enter Your Favorite Item:- ')
try:
    data_list = []
    for i in range(1,30):

        url = f'https://www.jumia.com.eg/{user_input}/?page={i}#catalog-listing'
        page = requests.get(url).content
        soup = BeautifulSoup(page,'lxml')
        phones = soup.find('div',class_='-paxs row _no-g _4cl-3cm-shs')
        phones_info = phones.find_all('article',class_=True)

        for i in phones_info:
            try:
                title = i.select('.name')[0].text.strip()
                current_price = i.select('.prc')[0].text
                old_price = i.find('div',class_='old')
                rating = i.find('div',class_='stars')
            except:
                pass

            row = {'Phone Title':title,'Current Price':current_price,'Old Price':old_price,'Rating':rating}
            data_list.append(row)
except:
    pass
df = pd.DataFrame(data_list)
df

HedgeHog · Accepted Answer · 2023-02-21T15:15:56.733

1

Main issue here seems to be that you append the bs4 objects for old_price and rating not its texts as you do with the first two title and current_price - So change to:

for i in phones_info:
    title = i.select_one('.name').get_text(strip=True)
    current_price = i.select_one('.prc').get_text(strip=True)
    old_price = i.select_one('.old').get_text(strip=True) if i.select_one('.old') else None
    rating = i.select_one('.stars').get_text(strip=True) if i.select_one('.stars') else None

    row = {'Phone Title':title,'Current Price':current_price,'Old Price':old_price,'Rating':rating}
    data_list.append(row)

Output

	Phone Title	Current Price	Old Price	Rating
0	Apple Iphone 14 Pro Max – 5G Single SIM – 256/6GB RAM – Deep Purple	EGP 54,499.00	EGP 70,000.00	None
1	Apple IPhone 13 Single SIM With FaceTime - 128GB - Pink	EGP 29,999.00	EGP 50,000.00	5 out of 5
2	Apple IPhone 13 Pro Max Single SIM With FaceTime - 512GB - AlpineGreen	EGP 55,499.00	EGP 65,000.00	None
3	Apple IPhone 12 Mini With FaceTime - 128GB - Blue	EGP 25,900.00	None	4.7 out of 5
4	Apple IPhone 12 With FaceTime - 128GB - Purple	EGP 27,900.00	None	4.2 out of 5
...
35	Apple Iphone 13 128G Green	EGP 31,900.00	None	None
36	Apple IPhone 13 / 512GB / Pink	EGP 40,900.00	None	None
37	Apple IPhone 13 (128GB) Red	EGP 31,900.00	None	None
38	Apple IPhone 13 Pro Single SIM With FaceTime - 128GB - Sierra Blue	EGP 41,900.00	None	None
39	Apple IPhone 13 128GB Starlight	EGP 31,900.00	None	None

edited Feb 21 '23 at 15:15

answered Feb 21 '23 at 13:05

HedgeHog

22,146
4
14
36

i tried that but the output not accurate and the None rows take the same value for the pervious row, please check the old_ price and rating output again – Mahmoud Badr Feb 21 '23 at 13:19
Focus of OP is *how can i extract the data without those list brackets* and that is the solution for - If you check the source of the page you will see that you get what you are searching for, still if it is not visible rendered to the screen. – HedgeHog Feb 21 '23 at 14:15
the issue in the above table for example is in index 3 & 4, the old price for them should be None and you can make sure from that in the original page, but the soultion that you have provided take the value of index 2 and repeat in in index 3 then the same matter happened in index 4 – Mahmoud Badr Feb 21 '23 at 14:29
Depends on how it is accessed, get different information depending on the point. Maybe next time you could provide some example structure. Anyway, the revision should correspond to both the OP and the annotation. – HedgeHog Feb 21 '23 at 18:18
Many thanks for your support, your last soultion has been working for me.. – Mahmoud Badr Feb 21 '23 at 21:53

score 0 · Answer 2 · answered Feb 21 '23 at 11:19

0

You can use re.sub to subsitute the brackets in the string for blanks as such:

import re

old_price = re.sub("\[|\]", "", old_price)
rating = re.sub("\[|\]", "", rating)

Sample input:

old_price = "[EGP 50,000.00]"
rating = "[5 out of 5, []]"

Output:

EGP 50,000.00
5 out of 5,

answered Feb 21 '23 at 11:19

B Remmelzwaal

1,581
2
4
11

could you share your answer in my full code as i tried your solution but it didn't work with me – Mahmoud Badr Feb 21 '23 at 11:25
Where in your code did you implement it? – B Remmelzwaal Feb 21 '23 at 11:28
above in my original question you will find my original code – Mahmoud Badr Feb 21 '23 at 11:28
I mean where did you implement the _solution_. Is it after first getting the values, e.g. right above the `except` clause? – B Remmelzwaal Feb 21 '23 at 11:32
`title = i.select('.name')[0].text.strip() current_price = i.select('.prc')[0].text old_price = i.find('div',class_='old') old_price = re.sub("\[|\]", "", old_price) rating = i.find('div',class_='stars') rating = re.sub("\[|\]", "", rating)` – Mahmoud Badr Feb 21 '23 at 11:34
And you also imported `re`? – B Remmelzwaal Feb 21 '23 at 11:36
yes, of course, imported it in my libararies – Mahmoud Badr Feb 21 '23 at 11:37
Could you check what the types of `old_price` and `rating` are? – B Remmelzwaal Feb 21 '23 at 11:38
RangeIndex: 46 entries, 0 to 45 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Phone Title 46 non-null object 1 Current Price 46 non-null object 2 Old Price 23 non-null object 3 Rating 0 non-null object dtypes: object(4) – Mahmoud Badr Feb 21 '23 at 11:39
I think you could try using `old_price.string` and `rating.string` in the `re.sub` expression to get the text. – B Remmelzwaal Feb 21 '23 at 11:48
solved for the "old_price" but the "rating" not solved yet – Mahmoud Badr Feb 21 '23 at 11:50
It would be extremely helpful to get a snippet of what the actual page data looks like. – B Remmelzwaal Feb 21 '23 at 11:52
you can enter on that page(https://www.jumia.com.eg/ios-phones/) – Mahmoud Badr Feb 21 '23 at 11:56
I am not able to reproduce your code using the part you've provided. Anyway, maybe `.text` would work, since the element contains `::before`. From [here](https://stackoverflow.com/q/52936783/17200348). – B Remmelzwaal Feb 21 '23 at 12:24

how to remove list brackets from the web scraped data

2 Answers2

Output