1

I need to get a filtered sample of twitter stream

I'm using tweepy I checked the functions for the class Stream to get sample stream and to filter

but I dint' catch how should I set the class

should it be

stream.filter(track=['']).sample()
stream.sample().filter(track=[''])

or each one in a line or what

And if you have another idea how to get a sample stream based on keyword filters please help

Thanks in advance

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
mohsen.21
  • 11
  • 1
  • 3

3 Answers3

2

Twitter v2 APIs include an endpoint for random sampling and endpoint for filtered tweets.

The latter allows for specifying a random sample percentage in a query (for example, sample:10 will return a random 10% sample).

Note that v2 APIs are still new and at the moment have a cap of 500k tweets per month.

As an example for the latter, the following code (modified version of this, see this doc) will collect streaming data with cat or dog tags and store it in a json file for every 100 tweets. (Note: this does not include the random sampling query.)

import requests
import os
import json

import pandas as pd
# To set your enviornment variables in your terminal run the following line:
# export 'BEARER_TOKEN'='<your_bearer_token>'


data = []
counter = 0

def create_headers(bearer_token):
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    return headers


def get_rules(headers, bearer_token):
    response = requests.get(
        "https://api.twitter.com/2/tweets/search/stream/rules", headers=headers
    )
    if response.status_code != 200:
        raise Exception(
            "Cannot get rules (HTTP {}): {}".format(response.status_code, response.text)
        )
    print(json.dumps(response.json()))
    return response.json()


def delete_all_rules(headers, bearer_token, rules):
    if rules is None or "data" not in rules:
        return None

    ids = list(map(lambda rule: rule["id"], rules["data"]))
    payload = {"delete": {"ids": ids}}
    response = requests.post(
        "https://api.twitter.com/2/tweets/search/stream/rules",
        headers=headers,
        json=payload
    )
    if response.status_code != 200:
        raise Exception(
            "Cannot delete rules (HTTP {}): {}".format(
                response.status_code, response.text
            )
        )
    print(json.dumps(response.json()))


def set_rules(headers, delete, bearer_token):
    # You can adjust the rules if needed
    sample_rules = [
        {"value": "dog has:images", "tag": "dog pictures"},
        {"value": "cat has:images -grumpy", "tag": "cat pictures"},
    ]
    payload = {"add": sample_rules}
    response = requests.post(
        "https://api.twitter.com/2/tweets/search/stream/rules",
        headers=headers,
        json=payload,
    )
    if response.status_code != 201:
        raise Exception(
            "Cannot add rules (HTTP {}): {}".format(response.status_code, response.text)
        )
    print(json.dumps(response.json()))


def get_stream(headers, set, bearer_token):
    global data, counter
    response = requests.get(
        "https://api.twitter.com/2/tweets/search/stream", headers=headers, stream=True,
    )
    print(response.status_code)
    if response.status_code != 200:
        raise Exception(
            "Cannot get stream (HTTP {}): {}".format(
                response.status_code, response.text
            )
        )
    for response_line in response.iter_lines():
        if response_line:
            json_response = json.loads(response_line)
            print(json.dumps(json_response, indent=4, sort_keys=True))
            data.append(json_response['data'])
            if len(data) % 100 == 0:
                print('storing data')
                pd.read_json(json.dumps(data), orient='records').to_json(f'tw_example_{counter}.json', orient='records')
                data = []
                counter +=1



def main():

    bearer_token = os.environ.get("BEARER_TOKEN")
    headers = create_headers(bearer_token)
    rules = get_rules(headers, bearer_token)
    delete = delete_all_rules(headers, bearer_token, rules)
    set = set_rules(headers, delete, bearer_token)
    get_stream(headers, set, bearer_token)



if __name__ == "__main__":
    main()


Then, load data in pandas dataframe as df = pd.read_json('tw_example.json', orient='records').

KM_83
  • 697
  • 3
  • 9
  • Thanks but here is still filtering the main complete stream api I want to get a sample stream with the api for it, but that stream is filtered upon keywords I see that here I determine the number of the tweets based on counter, so I create my own sampling code but the sample gives a 1% of the stream and claims to be of wide variation – mohsen.21 Oct 02 '20 at 16:31
  • @mohsen.21 you are right. V2 API again is separate for sampling streaming vs filtering streaming, not both functions together (and I edited the above answer). If you are using filtering API, you can write code to sample about 1-2% of the time (say, randomly selected 1 minute interval every hour). If you are using sampling API, you can write code to filter the tweets and store the relevant data. – KM_83 Oct 02 '20 at 16:49
  • how can we get stream of tweets only from posters with 1-10K followers? – mylord Mar 21 '23 at 11:08
0

I'd suggest reading the api documentation for tweepy. Here you can see how to filter the stream like you want to.

From reading other code snippets, i belive it should be done like this:

stream.filter(track=['Keyword'])
print(stream.sample())
Goldwave
  • 599
  • 3
  • 13
0

As I understand, tweepy uses twitter v1.1 APIs, which has separate APIs for sampling and filtering tweets in real time.

Twitter API references. v1 sample-realtime v1 filter-realtime

Approach 1: one can get filtered stream data using stream.filter(track=['Keyword1', 'keyord2']) etc. and then sample records from the collected data.

class StreamListener(tweepy.StreamListener):
    def on_status(self, status):
        # do data processing and storing here

see examples like https://www.storybench.org/how-to-collect-tweets-from-the-twitter-streaming-api-using-python/ Ignoring Retweets When Streaming Twitter Tweets

Approach 2: one can write program that starts and stops streaming in random time intervals (for example, random sampling of 3 min interval in every 15 minutes).

Approach 3: one can instead use the sampling API to collect data and then filter with keyword to store relevant data.

KM_83
  • 697
  • 3
  • 9
  • seems like the logical approach as twitter itself if it did provide the feature would go through the same steps you mentioned will still choosing wich approach of the three to take the code is very helpful btw thanks – mohsen.21 Oct 02 '20 at 19:49
  • @mohsen.21 so, later I did find a way to both filter and sample stream via Twitter v2 API. In the filtering streaming, one can also add "sample:10" to get 10% sample of the filtered stream. – KM_83 Oct 02 '20 at 20:08
  • for real that's great where it should be added like when setting the class of the streamer or after launching the filtered stream ? also is api 2 working with tweepy ? – mohsen.21 Oct 03 '20 at 12:58
  • @mohsen.21 it's Twitter v2 API, which is still on early access and not used by Tweepy. An example of how to use Twitter V2 API is provided in the script I posted in the another answer. If you want to sample, you can add a query condition like "sample:10" etc. Because V2 API Filtered Streaming has a 500k tweet cap per month (while V1.1 API Filtered Streaming does not seem to have such a cap), Twitter's documentation actually recommends adding a sampling condition like "sample:10". – KM_83 Oct 03 '20 at 15:56