3

I am trying to build a program which runs a function that input a url of a post, output the links of images and videos the post contain. It works really good for images. However, when it comes to get the links of videos, it return me a wrong url. I have no idea how to handle this situation.

https://scontent-lax3-2.cdninstagram.com/v/t50.2886-16/86731551_2762014420555254_3542675879337307555_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InZ0c192b2RfdXJsZ2VuLjcyMC5jYXJvdXNlbF9pdGVtIiwicWVfZ3JvdXBzIjoiW1wiaWdfd2ViX2RlbGl2ZXJ5X3Z0c19vdGZcIl0ifQ&_nc_ht=scontent-lax3-2.cdninstagram.com&_nc_cat=106&_nc_ohc=WDuXskvIuLEAX9rj7MU&vs=17877888256532912_3147883953&_nc_vs=HBksFQAYJEdCOXJLd1gyMVdhWUNkQUpBS090UGo3eEhDb3hia1lMQUFBRhUAAsgBABUAGCRHTFBXTUFVTXNPaG5XcW9EQU5KUEE5bEZVdVZxYmtZTEFBQUYVAgLIAQAoABgAGwGIB3VzZV9vaWwBMBUAABgAFuD4nJGH9sE%2FFQIoAkMzLBdAEszMzMzMzRgSZGFzaF9iYXNlbGluZV8xX3YxEQB17gcA&_nc_rid=97e769e058&oe=5EDF10A5&oh=3713c35f89fca1aa9554a281aa3421ed

https://scontent-gmp1-1.cdninstagram.com/v/t50.2886-16/0_0_0_\x00.mp4?_nc_ht=scontent-gmp1-1.cdninstagram.com&_nc_cat=100&_nc_ohc=Wnu_-GvKHJoAX9F_ui1&oe=5EDE8214&oh=7920ac3339d5bf313e898c3cbec85fa2

Here are two urls. The first one is copied from the sources of a web page, while the second one is copied from the data scraped by pyquery. They come from a same Instagram post, same path, but they are totally different. The first one works well, but the second one doesn't. How can I solve this? How can I get a right video url?

I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.

Here is my code related to the question

def getUrls(url):
    URL = str(url)
    html = get_html(URL)
    doc = pq(html)
    urls = []
    items = doc('script[type="text/javascript"]').items()
    for item in items:
        if item.text().strip().startswith('window._sharedData'):
            js_data = json.loads(item.text()[21:-1], encoding='utf-8')
            shortcode_media = js_data["entry_data"]["PostPage"][0]["graphql"]["shortcode_media"]
            edges = shortcode_media['edge_sidecar_to_children']['edges']

            for edge in edges:
                is_video = edge['node']['is_video']
                if is_video:
                    video_url = edge['node']['video_url']
                    video_url.replace(r'\u0026', "&")
                    urls.append(video_url)
                else:
                    display_url = edge['node']['display_resources'][-1]['src']
                    display_url.replace(r'\u0026', "&")
                    urls.append(display_url)


    return urls

Thanks in advance.

youngmac
  • 33
  • 1
  • 6
  • Are you doing that as an exercise to learn, or because [instaloader](https://github.com/instaloader/instaloader#readme) does not work for you? – mdaniel Jun 08 '20 at 03:27
  • Just for exercise. I know instaloader and its powerful functions, but its code is a little bit hard to read for me, a newlearner in scrapy. Thanks for your reply. – youngmac Jun 08 '20 at 09:51
  • You mention Scrapy, but your code does not look like it is using Scrapy at all. – Gallaecio Jun 15 '20 at 09:45

2 Answers2

1

I've seen this sometimes when using this Python module instead of HTML-scraping. At least with that module, edge["node"]["videos"]["standard_resolution"]["url"] usually (but not always) gives a non-corrupted value.

tonycpsu
  • 407
  • 5
  • 13
  • I tried out instaloader's API as suggested by @mdaniel above, and I'm still seeing video URLs with the corrupt `0_0_0_(null)` in them. [Here](https://gist.github.com/tonycpsu/f04f97c755a292d0a8243df1811cba52) is a gist of my test code that demonstrates the problem. If I go to [the post in question](https://www.instagram.com/p/B8JqUiIJHB9/), I can't play the video, and Firefox developer tools shows the broken URL as `ttps://instagram.fagc3-2.fna.fbcdn.net/v/t50.2886-16/0_0_0_%00.mp4?_[etc]`. Maybe just an IG issue with some posts having broken links to content on the server side? – tonycpsu Jun 09 '20 at 22:00
1

There's nothing wrong with your code. This is a known intermittent issue with Instagram, and other people have encountered it too: https://github.com/arc298/instagram-scraper/issues/545

There doesn't appear to be a known fix yet.


Also, while unrelated to your question, it's worth mentioning that you don't need to inspect the display_resources object to get the URL of the image:

display_url = edge['node']['display_resources'][-1]['src']

There is already a display_url property available (I'm guessing you saw this, based on the variable name?). So you can simply do:

display_url = edge['node']['display_url']
Raleigh L.
  • 599
  • 2
  • 13
  • 18