0

I'm doing a Web Crawling project, which needs to collect user's comments from each of the 200,000 authors' videos on a video website. Recently, this website updated its URL, create a new parameter(_signature) into their API URL. Is there any suggestion to fetch this new parameter?

The example web API URLs & URL are as below: https://www.ixigua.com/api/comment_module/video_comment?_signature=vhm.AAgEAy3U9zpQ3OMV74fpuAAOL1&item_id=6698972531753222663&group_id=6698972531753222663&offset=10 refers to: https://www.ixigua.com/i6698972531753222663/

https://www.ixigua.com/api/comment_module/video_comment?_signature=Xs6IHAAgEABXgvIJnklsal7OiAAAAI3&item_id=6699046583612211720&group_id=6699046583612211720&offset=10 refers to: https://www.ixigua.com/i6699046583612211720/

What I had to reach the original 200,000 authors is a list of their item_id/group_id(I stored them in Amazon A3, you may find it in my code below). Plus, item_id is the same as the group_id. So to move on, all I need is the _signature.

For different author, this website assign a unique _signature for them. In the example API URL, First author's _signature is: vh-m.AAgEAy3U9zpQ3OMV74fpuAAOL1. Second is Xs6IHAAgEABXgvIJnklsal7OiAAAAI3

This is what i am meeting trouble with. I went through the website and found it under XHR as a part of query string parameter but have no idea how to fetch it. Before this update, the API URL doesn't include the _signature. My original code went smoothly as below:

class Id1Spider(scrapy.Spider):
    name = 'id1'
    allowed_domains = ['www.ixigua.com']
    df = pd.read_csv('https://s3.amazonaws.com/xiguaid/group_id.csv')
    df = df.iloc[32640:189680,1] 

    list_id = df.unique().tolist()
    i = 0
    start_urls = ["https://www.ixigua.com/api/comment/list/?group_id="+ str(list_id[i]) +"&item_id="+ str(list_id[i]) +"&offset=0&count=20"]
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
    offset = 0
    count = 20

    def parse(self, response):
        data = json.loads(response.body)
        comments = data['data']['comments']
        total = data['data']['total']


        for ele in list(comments):
            try:
                comments_id=ele['user']['user_id']
                comments_text=ele['text']
                reply_count=ele['reply_count']
                digg_count=ele['digg_count']
                create_time=ele['create_time']              

            except:
                pass

            item = {
                    'comments_id':comments_id,
                    'comments_text':comments_text,
                    'reply_count':reply_count,
                    'digg_count':digg_count,
                    'create_time':create_time,
                    'item_id': self.list_id[self.i]
                } 
            yield item

        if data['data']['has_more']:
            self.offset += 20
            if self.offset > total:
                self.offset = total
            elif self.offset <= total: 
                self.offset = self.offset

            next_page_url = "https://www.ixigua.com/api/comment/list/?group_id="+ str(self.list_id[self.i]) +"&item_id="+ str(self.list_id[self.i]) +"&offset=" + str(self.offset)+"&count=" + str(self.count)
            yield scrapy.Request(url = next_page_url, callback = self.parse)

        else: 
            self.offset = 0
            self.count = 20
            self.i = self.i + 1
            try:
                next_page_url = "https://www.ixigua.com/api/comment/list/?group_id="+ str(self.list_id[self.i]) +"&item_id="+ str(self.list_id[self.i]) +"&offset=" + str(self.offset)+"&count=" + str(self.count)
                yield scrapy.Request(url = next_page_url, callback = self.parse)
            except:  
                pass

Thanks in advance for any suggestions!!! Please do let me know if I miss any information for solving this problem since this is my first post.

Sophia Z
  • 1
  • 2
  • 1
    I think you would need to include more information such as how you got the initial list of 200,000 authors previously. You could also include multiple examples (like 2-3) so we can know if the `item_id` and `group_id` are expected to be the same for all requests, or should be dynamically determined. Would it be possible to include your original code that was working previously? – Reedinationer Jun 05 '19 at 20:13
  • Thank you @Reedinationer, I have already updated. – Sophia Z Jun 06 '19 at 00:09
  • Yeah that's more likely to get people to help answer it. I don't really have any experience with XHR inside browsers, but perhaps [this post](https://stackoverflow.com/a/26947063/10924296) would be useful! – Reedinationer Jun 06 '19 at 22:32

0 Answers0