3

I was trying to scrape data from a website for my project.But the problem is I am not getting the tags in my outputs which I am seeing in my developer toolbar screen. the following is the snapshot of the the DOM from which I wanted to scrape the data :

<div class="bigContainer">
      <!-- ngIf: products.grid_layout.length > 0 --><div ng-if="products.grid_layout.length > 0">
        <div class="fl">
          <!-- ngRepeat: product in products.grid_layout --><!-- ngIf: $index%3==0 -->
          <div ng-repeat="product in products.grid_layout" ng-if="$index%3==0" class="GridItems">
          <grid-item product="product" gakey="ga_key" idx="$index" ancestors="products.ancestors" is-search-item="isSearchItem" is-filter="isFilter">
              <a ng-href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" ng-click="searchProductTrack(product, idx+1)" tabindex="0" href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" class="" style="">
           </grid-item>   

I am able to get the div tag with class "bigContainer" but I am not able to scrape the tags within this tag.For example if I want to get the grid-item tag,I got an empty list which means it shows that there is no such tag. Why is this happening? Please help!!

enterML
  • 2,110
  • 4
  • 26
  • 38
  • Please share the code you've written so far. – JRodDynamite Dec 31 '15 at 13:00
  • r= requests.get(url) soup = BeautifulSoup(r.content,"html.parser") plink = soup.find_all("div",{"class":"f1"})[0].find_all("grid-item")[0] – enterML Dec 31 '15 at 13:17
  • 2
    Check the HTML being passed to `BeautifulSoup` (i.e. `r.content`). It can differ from the HTML shown by your developer toolbar. If it lacks the `` tag, JavaScript is probably being used to insert content into the web page. If that is the case you need [a JavaScript-enabled browser such as Selenium](http://stackoverflow.com/q/17436014/190597) to obtain the content. – unutbu Dec 31 '15 at 13:17
  • When I tried to print the div tag with bigContainer class, no was displayed by soup. I am still wondering how to scrape that data then – enterML Dec 31 '15 at 13:19
  • check if the page has any web-api requests that are being sent, I think that it is the case here.. since the tags suggest that they use angularJS. If so we can use that to scrape the data – Alan Francis Dec 31 '15 at 14:05
  • check if the web-api from the site has the data you want, did a google search and found the site you were referring.. https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2 – Alan Francis Dec 31 '15 at 14:06
  • Yes, this is the site . I am not aware how to scrape data in case angular is used .Can you please tell me how to do that ? – enterML Dec 31 '15 at 14:20
  • @Nain updated my answer below with how to get the web-api url.. – Alan Francis Dec 31 '15 at 14:32
  • Thanks !!@ Alan Francis – enterML Dec 31 '15 at 14:38

2 Answers2

4

You can use the underlying web-api to extract the grid-item details, which are rendered by the angularJS javascript framework, so the HTML is not static.

One way to parse would be use selenium to get the data, but identifying the web-api is pretty simple using the developer tools of the browser.

EDIT: I use firebug add-on with firefox to see the GET requests made from "Net tab"

enter image description here

and the GET request for the page is:

https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2

And it returned a callback JS script, which was almost completely JSON data.

The JSON it returned contained the details for the grid items

Each grid item was described as a json object like below:

{
        "product_id": 23491960,
        "complex_product_id": 7287171,
        "name": "Samsung Galaxy Z1 (Black)",
        "short_desc": "",
        "bullet_points": {
            "salient_feature": ["Screen: 10.16 cm (4\")", "Camera: 3.1 MP Rear/VGA Front", "RAM: 768 MB", "ROM: 4 GB", "Dual-core 1.2 GHz Cortex-A7", "Battery: 1500 mAh/Li-Ion"]
        },
        "url": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
        "seourl": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
        "url_type": "product",
        "promo_text": null,
        "image_url": "https://assetscdn.paytm.com/images/catalog/product/M/MO/MOBSAMSUNG-Z1-BSMAR2320696B3C745/2.jpg",
        "vertical_id": 18,
        "vertical_label": "Mobile",
        "offer_price": 5090,
        "actual_price": 5799,
        "merchant_name": "SMARTBUY",
        "authorised_merchant": false,
        "stock": true,
        "brand": "Samsung",
        "tag": "+5% Cashback",
        "product_tag": "+5% Cashback",
        "shippable": true,
        "created_at": "2015-09-17T08:28:25.000Z",
        "updated_at": "2015-12-29T05:55:29.000Z",
        "img_width": 400,
        "img_height": 400,
        "discount": "12"
    }

So you can get the details without even using beautifulSoup in the following way.

import requests
import json

response = requests.get("https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2")
jsonResponse = ((response.text.split('angular.callbacks._3('))[1].split(');')[0])
data = json.loads(jsonResponse)
print(data["grid_layout"])
grid_data = data["grid_layout"]

for grid_item in grid_data:
    print("Brand:", grid_item["brand"])
    print("Product Name:", grid_item["name"])
    print("Current Price: Rs", grid_item["offer_price"])
    print("==================")

you would get output like

Brand: Samsung
Product Name: Samsung Galaxy Z1 (Black)
Current Price: Rs 4990
==================
Brand: Samsung
Product Name: Samsung Galaxy A7 (Gold)
Current Price: Rs 22947
==================

Hope this helps.

Alan Francis
  • 1,249
  • 11
  • 17
0

you can use "user agent" to get complete data. try something like this

Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0").timeout(10*1000).get();

sujit
  • 455
  • 6
  • 12