I have server X that has Hadoop and Flume installed, and I have server Y that has neither but is on the same network.
Server Y currently stores data into a log file that is continuously written two until a date stamp is appended at the end of the day and a new log file is started.
The goal is to live stream the logs from Server Y into Server X using flume, process the data and place it into HDFS.
I believe the best way is to have the syslog daemon in Server Y to forward those events via TCP, but there's a lot of hoops to step through within the organization to even know if this could be done. The other option would be (Option 2: )to somehow read from the file in the directory in Server Y, or (Option 3:) mount the directory to Server X, treating the directory as a spooling directory. The problem with option 2, is that Server Y does not have flume installed and doing so is out of the question. The problem with options 2 and 3 is that the incoming information may not be live, and there may be data lost during the transitions at the end of each day. There is also an authentication issue having to log into Server Y with a separate username and password. We obviously can't hardcode the information into the connection configuration.
My main question is: Does Flume need to be installed on the source server in order for this setup to work? Can the flume agent be run on Server X, exclusively? Which option would be ideal?