2

I did this small code to put files from a folder in a data stream:

public class TextFromDirStream {

//
//  Program
//

public static void main(String[] args) throws Exception {

    // set up the execution environment
    final StreamExecutionEnvironment env = StreamExecutionEnvironment
            .getExecutionEnvironment();

    // monitor directory, checking for new files
    // every 100 milliseconds
    TextInputFormat format = new TextInputFormat(
            new org.apache.flink.core.fs.Path("file:///tmp/dir/"));

    DataStream<String> inputStream = env.readFile(
            format,
            "file:///tmp/dir/",
            FileProcessingMode.PROCESS_CONTINUOUSLY,
            100,
            FilePathFilter.createDefaultFilter());

    inputStream.print();

    // execute program
    env.execute("Java read file from folder Example");
}

}

My next step is the deal with the file content (a csv). What is the most effective way to deal with this ? Do I change my code to parse the text file inputStream and transform it as a Tuple or readFile as a CSV from the beginning. I ask the question because I have difficulty to find example or documentation on how to split text to tuple.

Thank you in advance

1 Answers1

2

Starting with your code, each event in your stream (inputStream) is a line as String. You can just map a line into a TupleX :

DataStream<Tuple2<Long, String>> parsedStream = inputStream
   .map((line) -> {
     String[] cells = line.split(",");
     // Only keep first and third cells
     return new Tuple2(Long.parseLong(cells[2]), cells[0]); 
   });

You can also use readCsvFile which includes fields selection and which is able to create TupleX or POJO (but there's no PROCESS_CONTINUOUSLY with readCsvFile). Also note that if you use PROCESS_CONTINUOUSLY, each modified file will be processed entirely (again) which does not match with exactly one!

Eric Taix
  • 911
  • 1
  • 12
  • 24
  • Hi Eric, thank you very much for your answer. I suppose a small correction is required no ? => ... return new Tuple2(Long.parseLong(cells[2]), cells[0]); }); – Ignatius J. Reilly Apr 03 '17 at 17:18