I am new to flume, I have used flume to stream data from twitter using the search API. But the twitter json has the "geo" key set to null. So is there a way to get the twitter data using Streaming API in Flume.?
Asked
Active
Viewed 1,317 times
1 Answers
0
Please refer to this link. I helped me a lot when tried to do the same some time ago. Basically, you have to do the following:
- Create an application in https://dev.twitter.com/apps/ in order to generate the OAuth keys. This step is probably already done since you say you have already queried Twitter in the past.
Download the Cloudera sources specifically designed for Twitter from here and put such jar into the Flume classpath by editing
conf/flume-env.sh
and adding this line:FLUME_CLASSPATH="/home/training/Installations/apache-flume-1.3.1-bin/flume-sources-1.0-SNAPSHOT.jar"
Edit a Flume configuration file for a new Twitter agent called "TwitterAgent", something like:
TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sources.Twitter.consumerKey = <consumerKey> TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret> TwitterAgent.sources.Twitter.accessToken = <accessToken> TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret> TwitterAgent.sources.Twitter.keywords = <comma-separated list of keywords you are interested in> TwitterAgent.sinks.HDFS.channel = MemChannel TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/flume/tweets/ TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000 TwitterAgent.channels.MemChannel.type = memory TwitterAgent.channels.MemChannel.capacity = 10000 TwitterAgent.channels.MemChannel.transactionCapacity = 100
Then, you are ready to start the Twitter Flume agent by issuing this command:
$ bin/flume-ng agent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent
-
thanks frb, I have followed the same but the twitter json doesnt contain any value in the coordinates field and the geo value is set to null. – Hussain Shaik Apr 14 '15 at 11:07
-
Geolocation is a very sensitive data. If you take a look to https://dev.twitter.com/overview/terms/geo-developer-guidelines, you'll see that geolocation must be enabled by the users, among other restrictions. Are you sure the twits you are receiving must contain such kind of information? I mean, maybe you are crawling for general twits and it is highly probable the users that generated them did not enable the geolocation feature. But if you are analyzing very specific twits that you effectively know must contain that info... then it seems a problem with Twitter. – frb Apr 14 '15 at 11:24