2

I want to Scrape All the Post Containing some #hashtag from Instagram

I tried it from : https://www.instagram.com/explore/tags/perfume/?__a=1

But It's only giving some posts not every post.

Apeal Tiwari
  • 154
  • 2
  • 12
  • I believe they have an API, another way would be scrapy +Splash (with a different starting URL) – wishmaster Feb 08 '20 at 14:34
  • Does this answer your question? [How to get ALL Instagram POSTs by hashtag with the API (not only the posts of my own account)](https://stackoverflow.com/questions/43655098/how-to-get-all-instagram-posts-by-hashtag-with-the-api-not-only-the-posts-of-my). Take a look at this [answer](https://stackoverflow.com/a/48682863/3091398) – CodeIt Feb 08 '20 at 14:36

3 Answers3

7

Look carefully at the json you receive.

Navigate to graphql -> hashtag -> edge_hashtag_to_media -> page_info -> end_cursor

That's the identifier you have to use to specify the next batch of medias, like this:

https://www.instagram.com/explore/tags/perfume/?__a=1&max_id=QVFDNWJDZnpGbElpdEV5Q19aaldYWUsxZnc1YUd0Z21yNUZsOWw4V2NxX05ZWnZjT2pRb3lrY29ocDJnM0VNallUWGZVeDIxVURnUzltdHpBR1A1a0VRNw==

You can iterate this process to get more medias for requested hashtag.

A naive example with requests (python3) to extract first 10 batches.

import requests
import json
from time import sleep

max_id = ''

base_url = "https://www.instagram.com/explore/tags/perfume/?__a=1"
for i in range(0, 10):
    sleep(2) # Be polite.

    if max_id:
        url = base_url + f"&max_id={max_id}"
    else:
        url = base_url

    print(f"Requesting {url}")
    response = requests.get(url)
    response = json.loads(response.text)
    try:
        max_id = response['graphql']['hashtag']['edge_hashtag_to_media']['page_info']['end_cursor']
        print(f"New cursor is {max_id}")
    except KeyError:
        print("There's no next page!")
        break

As said in comment, be polite. Instagram will block you if you shoot too many requests per second.

Manuel Fedele
  • 1,390
  • 13
  • 25
  • you should pass header to request.get() to avoid getting 429 error. header looks like : headers = { "user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57", "cookie":"sessionid=YOUR_SESSION_ID;" } – parvaneh shayegh May 10 '21 at 14:21
1

The endpoint ?__a=1 doesn't work anymore. When it worked, it provided like 20 or 40 posts and the next page URL. Meaning that we had to make a sequence of calls until getting all posts or being rate-limited by the website.

Nowadays, there are services like this where folks offer access to a variety of non-official APIs for various social media: https://rapidapi.com/search/instagram
Some APIs offer a small amount of calls for free (say, 50/month) and all have paid plans for larger number of calls (say, 20k/day for n dollars).

Same as with past methods, the response has x number of posts and we have to keep loading the next posts.

brasofilo
  • 25,496
  • 15
  • 91
  • 179
0

You can use this library https://github.com/postaddictme/instagram-php-scraper/blob/master/examples/getMediasByTag.php

The function require a number of media as a parameter so if you want to recover all the media of a hashtags you will have to get the value of "graphql->hashtag->edge_hashtag_to_media->count" on the JSON feed https://www.instagram.com/explore/tags/perfume/?__a=1

Sapppz4
  • 49
  • 1
  • 5