How to use Python Multiprocessing with YouTube API for crawling

Question

I'm still a novice with python and now using multiprocessing is a big job for me.

So my question is, how do I speed to crawl the comments section of YouTube using the YouTube API whilst using multiprocessing?

This project is to crawl few 100000++ of videos for their comments in a limited time. I understand that multiprocessing is used on normal scraping methods such as BeautifulSoup/Scrapy, but how about when I use the YouTube API?

If I use the YouTube API (which requires API keys) to crawl the data, will multiprocessing be able to do the job using multiple keys or will it use the same one over and over again for different tasks?

To simplify, is it possible to use multiprocessing that uses API keys in the code instead of normal scraping methods that do not require API keys?

Anyone have any idea?

**"This project is to crawl few 100000++ of videos for their comments in a limited time."** - Not possible due to the captchas you'll get after some requests. — Pedro Lobito, Feb 13 '20 at 02:12
This is far too broad/vague. See: [tour], [ask], [help/on-topic]. — AMC, Feb 13 '20 at 02:14
@PedroLobito I see. So it hasn't been done yet and probably not even possible right. — nam, Feb 13 '20 at 02:19
@AMC How is it vague to you? How do I change it to make you understand better? — nam, Feb 13 '20 at 02:20
@AMC I think we got off on the wrong foot there mate. Not implying anything nor am I insulting your inteliigence. I apologize if it made you feel that way. Good day to you :) — nam, Feb 13 '20 at 02:36
@zuli Under what conditions would _How is it vague to you? How do I change it to make you understand better?_ followed by _Sorry if you didn't understand the question. Feel free to edit in a way that will improve your understanding. I think I was clear enough but I might be wrong. To simplify for you_ not be insulting, exactly? I'd like you to keep in mind that you're the one asking others for help here. — AMC, Feb 13 '20 at 02:38

Luke West · Accepted Answer · 2020-02-13T02:29:52.173

2

This won't directly answer your question, but I suggest having a look at the YouTube API quota:

https://developers.google.com/youtube/v3/getting-started#calculating-quota-usage

By default, your project will have a quota of just 10,000 units per day, and retrieving comments will cost between 1 and 5 units per comment (if you want the video data they're attached to, add another 21 units per video). Realistically, you'll be able to only retrieve 2000 comments per day via the API without putting in a quota increase request, which can take weeks.

Edit: Google will populate code for you in the language of your choice for a given request. I'd recommend populating the form here with your request, and using that as a starting point: https://developers.google.com/youtube/v3/docs/comments/list (click "Populate APIs Explorer" -> "See Code Samples" -> enter more info on the left)

edited Feb 13 '20 at 02:29

answered Feb 13 '20 at 02:20

Luke West

94
6

1

Yeah I have looked at the documentation and aware of the quota limitations. Since apparently I was given this challenge of completing the crawling of thousand of videos using the YouTube API within a limited time and to use multiprocessing on top of that, I was just not sure if this was doable or if it was an impossible feat. Thanks for the input anyway. If you do know anything more, do respond here. Thanks – nam Feb 13 '20 at 02:24
You can do it. I do it all the time. It's called "distributed crawling". However, you can't go through their API. The quota alone will choke you after video 4. You have to scrape it. If you try and subvert the quote using multiple projects with different API-keys, your account will be suspended. – M4cJunk13 Feb 17 '20 at 19:53

How to use Python Multiprocessing with YouTube API for crawling

1 Answers1