2

I was trying to make a list of all the top 1000 instagramer's acount from this website:'https://hypeauditor.com/top-instagram/'. The list that returns from lxml is empty for both lxml.html and lxml.etree.

I tried to delete tbody, delete text(), and upper xpath, but it all failed. what worth noticing is that, with upper xpath, it did return me something, but it is all but /n.

I first tried lxml.etree

market_url='https://hypeauditor.com/top-instagram/'
r_market=requests.get(market_url)
s_market=etree.HTML(r_market)`
file_market=s_market.xpath('//*[@id="bloggers-top-table"]/tr[1]/td[3]/a/text()')

then I also tried lxml.html.

tree=html.fromstring(r_market.content)
result=tree.xpath('//*[@id="bloggers-top-table"]/tr/td/h4/text()')

furthermore, I tried this xpath:

s_market.xpath('//*[@id="bloggers-top-table"]/tbody/text()')

It did not give me any error. But after all the attempts, it still gives me wether empty list or a list full of n/.

I am not really experienced in web scraping so it is possible that I have just made a stupid error somewhere, but since without the data I can not start my machine learning model, I am really struggling, pls help.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Onlyfood
  • 127
  • 11

3 Answers3

3

You will definitely want to get acquainted with the package BeautifulSoup which allows you navigate a web page's content in python.

Using BeautifulSoup:

import requests
from bs4 import BeautifulSoup

url = 'https://hypeauditor.com/top-instagram/'
r = requests.get(url)
html = r.text

soup = BeautifulSoup(html, 'html.parser')

top_bloggers = soup.find('table', id="bloggers-top-table")
table_body = top_bloggers.find('tbody')
rows = table_body.find_all('tr')

# For all data:
# Will retrieve a list of lists, good for inputting to pandas

data=[]

for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele]) # Get rid of empty values


# For just handles:
# Will retrieve a list of handles, only

handles=[]

for row in rows:
    cols = row.find_all('td')
    values = cols[3].text.strip().split('\n')
    handles.append(values[-1])

The for loop I use for rows is sourced from this answer

Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69
  • Thank you for your detailed and passionate answer, my question is solved. I will look into Beautiful Soup for sure. – Onlyfood May 30 '19 at 17:21
  • Just one more question, hope this is not too much to ask. How do I scrap all pages of the table instead of the first one? – Onlyfood May 30 '19 at 17:24
  • Are you asking about additional tables on this specific web page? Or additional web pages? – Yaakov Bressler May 31 '19 at 02:47
  • like, getting tables on 'https://hypeauditor.com/top-instagram/p=2' and all the way to 20. – Onlyfood May 31 '19 at 15:23
  • 1
    You can build a for loop or while loop to cycle through each of the id's in the url. For example: `urls = ['https://hypeauditor.com/top-instagram/p2=p{i}') for i in range(1,100)]` – Yaakov Bressler Jun 02 '19 at 02:49
2

An easier way to do this would be to use pandas. It can read simple HTML Tables like this no problem. Try the following code to scrape the whole table.

import pandas as pd

df = pd.read_html('https://hypeauditor.com/top-instagram/')
Thomas Hayes
  • 102
  • 6
2

Here is a more lightweight way of getting just that column using nth-of-type. You should find this faster.

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://hypeauditor.com/top-instagram/')
soup = bs(r.content, 'lxml')
accounts = [item.text.strip().split('\n') for item in soup.select('#bloggers-top-table td:nth-of-type(4)')][1:]
print(accounts)
QHarr
  • 83,427
  • 12
  • 54
  • 101