7

After reading several documentation pages of Apache Flink (official documentation, dataartisans) as well as the examples provided in the official repository, I keep seeing examples where they use as the data source for streamming a file already downloaded, connecting always to the localhost.

I am trying to use Apache Flink to download JSON files which contain dynamic data. My intention is to try to stablish the url where I can access the JSON file as the input source of Apache Flink, instead of downloading it with another system and processing the downloaded file with Apache Flink.

Is it possible to stablish this net connection with Apache Flink?

Alvaro Gomez
  • 350
  • 2
  • 7
  • 22

1 Answers1

5

You can define the URLs you want to download as your input DataStream and then download the documents from within a MapFunction. The following code demonstrates this:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<String> inputURLs = env.fromElements("http://www.json.org/index.html");

inputURLs.map(new MapFunction<String, String>() {
    @Override
    public String map(String s) throws Exception {
        URL url = new URL(s);
        InputStream is = url.openStream();

        BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(is));

        StringBuilder builder = new StringBuilder();
        String line;

        try {
            while ((line = bufferedReader.readLine()) != null) {
                builder.append(line + "\n");
            }
        } catch (IOException ioe) {
            ioe.printStackTrace();
        }

        try {
            bufferedReader.close();
        } catch (IOException ioe) {
            ioe.printStackTrace();
        }

        return builder.toString();
    }
}).print();

env.execute("URL download job");
Till Rohrmann
  • 13,148
  • 1
  • 25
  • 51
  • I run example code, but it is only run once and read all the file. However Iit is not streaming, I thought it will contiune read when there is incease in the json file. – zt1983811 Feb 22 '17 at 14:49
  • For that you would have to use the `ContinuousFileMonitoringFunction`. Streaming per se does not mean that the job will run infinitely long. This only happens if you have a non-finite source. But in this case the `env.fromElements` function produces a finite streaming source. Once this source reaches it's end, the program terminates. – Till Rohrmann Feb 27 '17 at 12:51