1

I have a script which stores incoming tweets for a phrase (e.g. "python") into database table "A" using twitter streaming api. Later, another script searches the same phrase using twitter search api and stores results into table "B". My question is why there are some tweets in "A" that are not in "B" and vice versa.

I can think of one reason to have tweets in "B" and not in "A":

"A" only contains tweets that are posted after streaming api started while search api returns results from the last week. If streaming api has been running for more than a week, then there must not any tweet in "B" that is not in "A".

I know two reasons to have some tweets in "A" and not in "B":

  1. search API only returns only results from the last week while streaming api returns everything
  2. search API returns only a portion of results and not all as its focus is not on completeness.

I'd like to make sure if I got it correct or not.

PHA
  • 1,588
  • 5
  • 18
  • 37

1 Answers1

1

For "B" not in "A" you are correct. A big indication of that is from the Search API link you included:

It allows queries against the indices of recent or popular Tweets...

For "A" not in "B" you're correct as well but with minor mistakes.

  1. The Streaming API will not return everything, it will only return 1% of the total tweets. The 1% filter is done internally in Twitter and there has not been any indication on how it's done. There has been an annoucement not long ago about fixing the 1% to make a true 1%, but I can't seem to find the link where I read it at.
  2. With the Streaming API you're also impaired by (more commonly):
    • Public stream limit (reaching 1%)
    • Stall warnings (warning)

Few others depending on your use https://dev.twitter.com/streaming/overview/messages-types

Leb
  • 15,483
  • 10
  • 56
  • 75
  • Thanks @Leb for reply. I think I find the announcement about [adjustments to sample volumes](https://twittercommunity.com/t/potential-adjustments-to-streaming-api-sample-volumes/31628) . However, it seems unless you get "rate limit" messages in the stream you're getting 100% of the tweets matching the criteria and if this is true, I almost won't miss any tweet as I track none-general phrases and 1% is still a large amount. – PHA Sep 22 '15 at 08:13
  • @PHA, I believe that it's possible that you can get 100% of the 1%, but that can only be achieved if you have no filter at all. Since Twitter gets about [500M/day](http://www.internetlivestats.com/twitter-statistics/) as of 2013, 5M/day is pretty good. From my experience, there's a pool of data that is filtered by Twitter that gets entered into the Streaming API which we access. That's not confirmed either just a theory based on my tests. – Leb Sep 22 '15 at 11:19
  • so what you said means, if I have a stream with 10 none-popular phrases and I got one tweet per minute on this stream, then I still might miss some of the tweets for those 10 phrases? – PHA Sep 22 '15 at 11:34
  • I will be testing that scenario soon. My previous attempts have shown me that with 4 connections (the most that twitter allowed me) all of them resulted in same data. Each connection was collecting the same information, and my assumption was to test if having multiple connections will increase the data you obtained, that definitely wasn't the case. I'll keep you updated if you're interested. – Leb Sep 22 '15 at 13:14
  • It sounds good that your test was successful so at least it is consistent, but it is even more interesting to know if it is actually complete or not for phrases with less amount of tweets. I will be really appreciated if you can keep me updated. – PHA Sep 22 '15 at 14:13
  • I'll keep you updated regarding more research. To truly test completeness, it must be done against firehose. Only then it will be confirmed. – Leb Sep 22 '15 at 14:59