How to scrape multiple websites with different data in urls

Question

I'm scraping some data from a webpage where at the end of the url has the id of the product, it appears to rewrite the data at every single row, like its not appending the data from the next line, I don't know exactly what's going on, if my first for is wrong, or the indentation, I tried before without the dictionary, and it was appending but at the same line and I transpose it but didn't work as I wanted so I made it this way and now it doesn't append the next lines, help please

data_cols = []
cols = {'pro_header': [],
        'pro_id': [],
        .
        .
        .
        'pro_uns5': []
        }
#the id for each product
fileID = open('idProductsList.txt', 'r')
proIDS = fileID.read().split()
for proID in proIDS:
    url = 'https:/website.com/mall/es/mx/Catalog/Product/' + proID
    html = urllib2.urlopen(url).read()
    soup = bs.BeautifulSoup(html , 'lxml')
    table = soup.find("table",{"class": "ProductDetailsTable"})
    rows = table.find_all('tr')
    for row in rows:
        labels.append(str(row.find_all('td')[0].text))
        try:
            data.append(str(row.find_all('td')[1].text))
        except IndexError:
            data.append('')

    cols['pro_header'].append(data[0])
    cols['pro_id'].append(data[1])
    .
    .
    .
    cols['pro_uns5'].append(data[43])
    df = pd.DataFrame(cols)
    df.set_index
    #df.reindex()
    df.to_csv('sample1.csv')

The actual output is:

pro_id  pro_priceCostumer   pro_priceData
1FK7011-5AK24-1AA3  " Mostrar precios
"   PM300:Producto activo
1FK7011-5AK24-1AA3  " Mostrar precios
"   PM300:Producto activo
1FK7011-5AK24-1AA3  " Mostrar precios
"   PM300:Producto activo

Should be something like this (This is just a small representation of the data):

pro_id  pro_priceCostumer   pro_priceData
1FK7011-5AK24-1AA3  " Mostrar precios
"   PM300:Producto activo
1FK7011-5AK24-1JA3  " Mostrar precios
"   PM300:Producto activo
1FK7022-5AK21-1UA0  " Mostrar precios
"   PM300:Producto activo

can you share the url? I'm thinking it might be quicker/easier to access the data API...or since you are grabing `` tags, just go with pandas to pull it. — chitown88, Nov 06 '19 at 16:53
I'm not sure what the difference is here between your actual output and the desired output. They look the same — chitown88, Nov 06 '19 at 16:53

score 0 · Answer 1 · edited Nov 06 '19 at 09:53

0

I guess labels are working as a variable. to append this you need to use a list. add labels=list() at the top of your code as global variable. The same thing should be done for data too.

edited Nov 06 '19 at 09:53

Nazim Kerimbekov

4,712
8
34
58

answered Nov 06 '19 at 07:25

Araf

263
1
5
19

How to scrape multiple websites with different data in urls

1 Answers1