0

I am new to flume, I have used flume to stream data from twitter using the search API. But the twitter json has the "geo" key set to null. So is there a way to get the twitter data using Streaming API in Flume.?

Lalit Kumar B
  • 47,486
  • 13
  • 97
  • 124
Hussain Shaik
  • 117
  • 1
  • 4
  • 11

1 Answers1

0

Please refer to this link. I helped me a lot when tried to do the same some time ago. Basically, you have to do the following:

  • Create an application in https://dev.twitter.com/apps/ in order to generate the OAuth keys. This step is probably already done since you say you have already queried Twitter in the past.
  • Download the Cloudera sources specifically designed for Twitter from here and put such jar into the Flume classpath by editing conf/flume-env.sh and adding this line:

    FLUME_CLASSPATH="/home/training/Installations/apache-flume-1.3.1-bin/flume-sources-1.0-SNAPSHOT.jar"
    
  • Edit a Flume configuration file for a new Twitter agent called "TwitterAgent", something like:

    TwitterAgent.sources = Twitter
    TwitterAgent.channels = MemChannel
    TwitterAgent.sinks = HDFS
    
    TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
    TwitterAgent.sources.Twitter.channels = MemChannel
    TwitterAgent.sources.Twitter.consumerKey = <consumerKey>
    TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret>
    TwitterAgent.sources.Twitter.accessToken = <accessToken>
    TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret>
    
    TwitterAgent.sources.Twitter.keywords = <comma-separated list of keywords you are interested in>
    
    TwitterAgent.sinks.HDFS.channel = MemChannel
    TwitterAgent.sinks.HDFS.type = hdfs
    TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/flume/tweets/
    TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
    TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
    TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
    TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
    TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
    
    TwitterAgent.channels.MemChannel.type = memory
    TwitterAgent.channels.MemChannel.capacity = 10000
    TwitterAgent.channels.MemChannel.transactionCapacity = 100
    

Then, you are ready to start the Twitter Flume agent by issuing this command:

$ bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
frb
  • 3,738
  • 2
  • 21
  • 51
  • thanks frb, I have followed the same but the twitter json doesnt contain any value in the coordinates field and the geo value is set to null. – Hussain Shaik Apr 14 '15 at 11:07
  • Geolocation is a very sensitive data. If you take a look to https://dev.twitter.com/overview/terms/geo-developer-guidelines, you'll see that geolocation must be enabled by the users, among other restrictions. Are you sure the twits you are receiving must contain such kind of information? I mean, maybe you are crawling for general twits and it is highly probable the users that generated them did not enable the geolocation feature. But if you are analyzing very specific twits that you effectively know must contain that info... then it seems a problem with Twitter. – frb Apr 14 '15 at 11:24