1

I am new to webscraping. So I have been given a task to extract data from : Here

I am choosing dataset of "comments". Below is my code for scraping.

import requests
from bs4 import BeautifulSoup
url = 'https://www.kaggle.com/hacker-news/hacker-news'
headers = {'User-Agent' : 'Mozilla/5.0'}
response = requests.get(url, headers = headers)
response.status_code
response.content
soup = BeautifulSoup(response.content, 'html.parser')
soup.find_all('tbody', class_ = 'TableBody-kSbjpE jGqIxa')

When I try to execute the last command it returns : [].

So, I am stuck here. I know we can get the data from kernel, but just for practice purpose where am I going wrong? Am I choosing wrong class? I want to scrape the data and probably save it to a CSV file or to a No-SQL Database, preferred Cassandra.

Stimks
  • 33
  • 1
  • 1
  • 9
  • 1
    Did you check content? Does it contain elements you trying to find? – baklarz2048 Aug 14 '18 at 11:15
  • @baklarz2048 well yeah, I have inspected it, and it contains all the rows and columns that I want to extract. it's strange as to why it is returning null. – Stimks Aug 14 '18 at 11:17

2 Answers2

2

you are getting this [] because data you want to scrape is coming from API which loads after you web page load so page you are accessing does not contain that class

you can open you browser console and check in network as given in screenshot there you find data you want to scrape so you have to make request to that URL to get data

enter image description here

you can retrive data in this URL in preview tab you can see all data.

also if you have good knowledge of python you can also use this to scrape data

https://doc.scrapy.org/en/latest/intro/overview.html

Milan Hirpara
  • 166
  • 1
  • 9
  • Hi, thanks that did cleared up most of my understanding. Also if now I want to scrape this using (scrapy) what response should I pass? for eg : table = response.xpath('//*[@class= "GetDataView"]') As I am not able to find its reference to scrap its data. – Stimks Aug 14 '18 at 12:16
  • Hello @Stimks i don't think you have to scrape any thing if you are able to get repsonse in API mentioned in screenshot then no need to scrape anything because you wiil get data in json format so just get response from that API – Milan Hirpara Aug 14 '18 at 12:30
  • yeah when I tried it gave me 404 and later something with GET. Thank You – Stimks Aug 15 '18 at 03:27
0

Even though you were able to see the 'tbody', class_ = 'TableBody-kSbjpE jGqIxa' in the element inspector, the request that you make does not contain this class. See for yourself print(soup.prettify()). This is most likely because you're not requesting the correct url.

This may be not something you're aware of, but as a fyi: You don't actually need to scrape using BeautifulSoup, you can get a list of all the available datasets from the API. Once you have it installed and configured, you can get the dataset: kaggle datasets download -d . Here's more info if you wish to proceed with the API instead: https://github.com/Kaggle/kaggle-api

usr
  • 782
  • 1
  • 7
  • 25
  • Thank you, I will try this using kaggle API. Is it possible to connect that stream of data directly to a No-SQL Database like a pipeline? – Stimks Aug 14 '18 at 12:19
  • That would depend on your environment and set up, but I can't see why not. Though, why is it that you want a nosql database. You're extracting tabular data, so why use something that's designed for unstructured data. Won't a simple sql database not do the job? – usr Aug 14 '18 at 12:33
  • I was trying it for practise and learning purpose. Hmm, So I guess its convenient to connect it directly to a database for it to store instead of downloading whole data. Should batching be there? I will give it a go with SQL first. – Stimks Aug 15 '18 at 03:25
  • This a BigQuery dataset that OP is attempting to scrape so there are no files available for download: https://www.kaggle.com/hacker-news/hacker-news. It's not possible to download BigQuery datasets via the API or web UI on Kaggle. – Meg Risdal Aug 19 '18 at 00:55