Questions tagged [apache-flink]

Apache Flink is an open source platform for scalable batch and stream data processing. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.

Apache Flink

case class WordWithCount(word: String, count: Int)

val text = env.readTextFile(path)

val counts = text.flatMap { _.split("\\W+") }
  .map { WordWithCount(_, 1) }
  .groupBy("word")
  .sum("count")

counts.writeAsCsv(outputPath)

These are some of the unique features of Flink:

Hybrid batch/streaming runtime that supports batch processing and data streaming programs.
Custom memory management to guarantee efficient, adaptive, and highly robust switching between in-memory and out-of-core data processing algorithms.
Flexible and expressive windowing semantics for data stream programs.
Built-in program optimizer that chooses the proper runtime operations for each program.
Custom type analysis and serialization stack for high performance.

Learn more about Flink here.

Building Apache Flink from Source

Prerequisites for building Flink:

Unix-like environment (We use Linux, Mac OS X, Cygwin)
git
Maven (at least version 3.0.4)
Java 6, 7 or 8 (Note that Oracle's JDK 6 library will fail to build Flink, but is able to run a pre-compiled package without problem)

Commands:

git clone https://github.com/apache/flink.git
cd flink
mvn clean package -DskipTests

Flink is now installed in build-target

Developing Flink

The Flink committers use IntelliJ IDEA and Eclipse IDE to develop the Flink codebase.

Minimal requirements for an IDE are:

Support for Java and Scala (also mixed projects)
Support for Maven with Java and Scala

IntelliJ IDEA

The IntelliJ IDE supports Maven out of the box and offers a plugin for Scala development.

IntelliJ download: https://www.jetbrains.com/idea/
IntelliJ Scala Plugin: http://plugins.jetbrains.com/plugin/?id=1347

Check out our Setting up IntelliJ guide for details.

Eclipse Scala IDE

For Eclipse users, we recommend using Scala IDE 3.0.3, based on Eclipse Kepler. While this is a slightly older version, we found it to be the version that works most robustly for a complex project like Flink.

Further details, and a guide to newer Scala IDE versions can be found in the How to setup Eclipse docs.

Note: Before following this setup, make sure to run the build from the command line once (mvn clean install -DskipTests, see above)

Download the Scala IDE (preferred) or install the plugin to Eclipse Kepler. See How to setup Eclipse for download links and instructions.
Add the "macroparadise" compiler plugin to the Scala compiler. Open "Window" -> "Preferences" -> "Scala" -> "Compiler" -> "Advanced" and put into the "Xplugin" field the path to the macroparadise jar file (typically "/home/-your-user-/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar"). Note: If you do not have the jar file, you probably did not run the command line build.
Import the Flink Maven projects ("File" -> "Import" -> "Maven" -> "Existing Maven Projects")
During the import, Eclipse will ask to automatically install additional Maven build helper plugins.
Close the "flink-java8" project. Since Eclipse Kepler does not support Java 8, you cannot develop this project.

Support

Don’t hesitate to ask!

Contact the developers and community on the mailing lists if you need any help.

Open an issue if you found a bug in Flink.

Documentation

The documentation of Apache Flink is located on the website: http://flink.apache.org or in the docs/ directory of the source code.

Fork and Contribute

This is an active open-source project. We are always open to people who want to use the system or contribute to it. Contact us if you are looking for implementation tasks that fit your skills. This article describes how to contribute to Apache Flink.

About

Apache Flink is an open source project of The Apache Software Foundation (ASF). The Apache Flink project originated from the Stratosphere research project.

7452 questions

votes

2 answers

Flink: DataSet.count() is bottleneck - How to count parallel?

I am learning Map-Reduce using Flink and have a question about how to efficiently count elements in a DataSet. What I have so far is this: DataSet ds = ...; long num = ds.count(); When executing this, in my flink log it says 12/03/2016…

java mapreduce apache-flink

asked Dec 03 '16 at 19:09

user7246017

votes

1 answer

Use C/C++ in Apache-Flink

My team and I are developing an application that makes use of Flink. The data will be processed using a computationally-heavy numerical algorithm. In order to optimize it as much as possible, I would like to write this algorithm in C/C++ rather than…

c++ apache-flink

asked Dec 01 '16 at 09:09

Francesco Calcavecchia

votes

1 answer

Where to find documentation about DAG optimizations?

Is there any documentation about what (and how are performed) are the optimizations that are performed when transforming the Application DAG into the physical DAG?

apache-flink

asked Nov 29 '16 at 14:29

Luis Alves

1,286
12
32

votes

1 answer

Difference between DSMS, Storm and Flink

DSMS corresponds to Data Stream Management Systems. These systems allow users to submit queries that will be continuously executed until being removed by the user. Can systems such as Storm and Flink be seen as DSMS or are they something more…

apache-spark apache-storm apache-flink apache-kafka-streams

asked Nov 20 '16 at 12:50

Luis Alves

1,286
12
32

votes

1 answer

How to do a keyBy of first tuple field element of a dataSet Batch

I am trying to convert my application from flink stream processing to flink batch processing. For flink data stream, I read string from a pre-defined file with multiple JSON objects and do a flatmap from Json Objects to a tuple3 collector (first…

apache-flink

asked Nov 18 '16 at 01:36

flinkexplorer

votes

2 answers

Apache Flink: Window Functions and the beginning of time

In a WindowAssigner, an element gets assigned to one or more TimeWindow instances. In case of a sliding event time window, this happens in SlidingEventTimeWindows#assignWindows1. In case of a window with size=5 and slide=1, an element with…

apache-flink flink-streaming

asked Nov 17 '16 at 15:13

Jonas Gröger

1,558
2
21
35

votes

1 answer

Apache Flink: Scope of ValueState in ConnectedStreams

I have a custom RichCoFlatMapFunction that uses a ValueState member. The docs say that the key/value interface is scoped to the key of the current input element See…

apache-flink flink-streaming

asked Nov 15 '16 at 15:28

Jonas Gröger

1,558
2
21
35

votes

1 answer

Using grok in flink streaming

Flink Pipeline is as follows: read messages(string) from kafka topic. pattern matching through grok converting to json format. Aggregations over a time window over extracted field from json. Below is the code for pattern matching using grok. …

java serialization apache-flink grok flink-streaming

asked Nov 15 '16 at 15:07

user3351750

votes

1 answer

Flink Streaming: From one window, lookup state in another window

I have two streams: Measurements WhoMeasured (metadata about who took the measurement) These are the case classes for them: case class Measurement(var value: Int, var who_measured_id: Int) case class WhoMeasured(var who_measured_id: Int, var name:…

apache-flink flink-streaming

asked Nov 11 '16 at 18:05

Jonas Gröger

1,558
2
21
35

votes

0 answers

NullPointerException in JDBCInputFormat.open when trying to read DataSet from MS SQL

For proessing with Apache Flink I am trying to create a DataSet from data given in a Microsoft SQL database. The test_table has two columns, "numbers" and "strings" which contain INTs and VARCHARs respectively. // supply row type…

java sql-server jdbc apache-flink

asked Nov 11 '16 at 11:26

danny

votes

0 answers

Event sourcing in Flink

I have a Flink application that was implemented following the event-sourcing paradigm. Both events and commands are stored in several Kafka topics. The application has two startup modes: recovery and production. First, the recovery mode is used to…

apache-kafka apache-flink event-sourcing savepoints

asked Nov 08 '16 at 10:04

user2108278

votes

1 answer

How do I iterate over each message in a Flink DataStream?

I have a message stream from Kafka like the following DataStream messageStream = env .addSource(new FlinkKafkaConsumer09<>(topic, new MsgPackDeserializer(), props)); How can I iterate over each message in the stream and do something with…

apache-flink flink-streaming

asked Nov 03 '16 at 01:56

Meghashyam Sandeep V

votes

0 answers

Flink Timestamp monotony violated

I'm consuming data using a Kafka Consumer 0.9 and after some initial filtering for valid events and mapping to POJO I am setting AscendingTimestampExtractor. using Joda Time library I'm returning the milliseconds since epoch as follows: return…

java apache-flink

asked Nov 02 '16 at 12:45

AtharvaI

1,160
3
16
27

votes

1 answer

Flink : cannot cancel a running job (streaming)

I want to run a streaming job. When I try to run it locally using start-clusted.sh and the Flink Web Interface, I have no problem. However, I am currently trying to run my job using Flink on YARN (deployed on Google Dataproc) and when I try to…

apache-flink flink-streaming

asked Oct 19 '16 at 09:22

Mel-BR

votes

1 answer

Obtain KeyedStream from custom partitioning in Flink

I know that Flink comes with custom partitioning APIs. However, the problem is that, after invoking partitionCustom on a DataStream you get a DataStream back and not a KeyedStream. On the other hand, you cannot override the partitioning strategy for…

apache-flink data-partitioning

asked Oct 18 '16 at 07:25

affo

Prev 1 2 3

…

100