1

We are saving tweets in a directory order like /user/flume/2016/06/28/13/FlumeData... .But each hour it creates more than 100 FlumeData file.I have changed TwitterAgent.sinks.HDFS.hdfs.rollSize = 52428800 (50 mb) same thing happened again.After that I tried with changing rollcount parametre too but didnt work.How can i set parametres to get one FlumeData file per hour.

mgurcan
  • 170
  • 1
  • 12

3 Answers3

0

What about rollInterval? Did you set it zero. If it is, then the issue might be something else. If the rollInterval is set to some value, it will kind of override the rollSize and rollCount values. The file rotation might happen before the file size reaches the rollSize value. Also, check the HDFS block size you set. If it is set to, too small value even that might cause the file rolling.

Try this -

    TwitterAgent.sinks.HDFS.channel = MemChannel
    TwitterAgent.sinks.HDFS.type = hdfs
    TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hpc01:8020/user/flume/tweets/%Y/%m/%d/%H
    TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
    TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text

    TwitterAgent.sinks.HDFS.hdfs.batchSize = 100


    TwitterAgent.sinks.HDFS.hdfs.rollSize = 0

    TwitterAgent.sinks.HDFS.hdfs.rollCount = 0

    TwitterAgent.sinks.HDFS.hdfs.rollInterval = 3600
    TwitterAgent.channels.MemChannel.type = memory
    TwitterAgent.channels.MemChannel.capacity = 1000

    TwitterAgent.channels.MemChannel.transactionCapacity = 100
ViKiG
  • 764
  • 9
  • 21
  • rollInterval was unsetted but i think default value is zero already and our hdfs block size is 128mb. – mgurcan Jul 11 '16 at 07:07
  • Can you post the flume configuration file here? Initially even I had the same problem. I could not create 1 file per hour due to memory errors (due to channel capacities are not working properly) but I brought it down to 4 to 5 files per hour. Each file sizes up to 8MB. I put the channel capacity as 1000 and transaction capacity to 100. Then put the file size to 8000000 and rest roll parameters to zero. Importantly, the batch size (I put it at 100) notch it up according to the channel capacity. Try that and let me know how are your results. – ViKiG Jul 11 '16 at 07:26
  • You have put the roll count to 10, change it to zero. The roll count is causing those hundreds of files. Increase the batch size to 100 or more. Put the roll interval as 1 hour (3600 seconds) and see what happens. – ViKiG Jul 11 '16 at 09:13
  • Try the configuration I added in my answer. – ViKiG Jul 11 '16 at 10:24
  • It did not work you mean - it is still creating large number of small files? – ViKiG Jul 11 '16 at 10:39
  • Hey.. Your `rollInterval` spelling is wrong in the configuration file. – ViKiG Jul 11 '16 at 10:57
  • i thinks it is correct.I copy-paste [https://flume.apache.org/FlumeUserGuide.html] from here. – mgurcan Jul 11 '16 at 11:21
  • but i am trying again from all beginning. – mgurcan Jul 11 '16 at 11:22
  • i tried with 10 roll intervall (just for trying) i think it is working now i will wait 1 hour thanks for quick reply. – mgurcan Jul 11 '16 at 11:53
  • If the `rollInterval=10`, it will create a new file every 10 seconds. I believe that is not what we want. – ViKiG Jul 11 '16 at 12:05
  • sure i set this value for trying correctness and set back again 3600 ,it is working now one file per hour. – mgurcan Jul 11 '16 at 13:10
  • good.. is it the same configuration as given in my answer (I updated it). If yes accept my answer. – ViKiG Jul 11 '16 at 14:14
  • Before you accept my solution correct, try running the configuration for at least 5 hours (without any errors). If it did throw some errors, post those errors back here. – ViKiG Jul 11 '16 at 14:18
0
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hpc01:8020/user/flume/tweets/%Y/%m/%d/%H
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text

TwitterAgent.sinks.HDFS.hdfs.batchSize = 1


TwitterAgent.sinks.HDFS.hdfs.rollSize = 0

TwitterAgent.sinks.HDFS.hdfs.rollCount = 10

TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000

TwitterAgent.channels.MemChannel.transactionCapacity = 1000
mgurcan
  • 170
  • 1
  • 12
  • Hey check here in your answer, the `rollInterval` is written as `rollIntInterval`. That's incorrect. – ViKiG Jul 11 '16 at 11:28
0

I resolved this problem with setting rollInterval=3600 rollcount=0 and batchSize=100 flume.conf parametres as @vkgade suggest

mgurcan
  • 170
  • 1
  • 12