Questions tagged [pyflink]

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. PyFlink makes it available to Python.

PyFlink makes all of Apache Flink available to Python and at the same time Flink benefits from Ptyhon's rich scientific computing functions.

What is PyFlink on apache.org

258 questions
1
vote
0 answers

How to specify CREATE TABLE in Flink SQL when receiving data stream of non-primitive types (using PyFlink)?

A Flink SQL application receives data from an AWS Kinesis Data Stream, where the received messages are in JSON and where the schema is expressed in JSON Schema and which contains a property which is not a primitive object, for example: { "$id":…
John
  • 10,837
  • 17
  • 78
  • 141
1
vote
1 answer

Flink SQL behavior

I want to execute Flink SQL on batch data. (CSVs in S3) However, I explicitly want Flink to execute my query in a streaming fashion because I think it will be faster than the batch mode. For example, my query consists of filtering on two tables and…
bumpbump
  • 542
  • 4
  • 17
1
vote
1 answer

python-archives Not A Directory Exception While Running Flink Job - PyFlink

I'm getting the following exception when running a pyflink application: I'm using start-cluster.sh to start the flink cluster I'm using Python virtual environment to run the flink job (/root/Python3.6/venv.zip) I've set archive path in the…
Denorm
  • 466
  • 4
  • 13
1
vote
0 answers

Looking for an example of Pyflink with Kinesis

I have a kinesis stream to which I want to listen using pyflink. I have installed the Apapache-flink@1.12.2 package for python3, I saw a few examples for using Kinesis in python (such as this one: https://stackoverflow.com/a/22403036 ) and…
user1322801
  • 839
  • 1
  • 12
  • 27
1
vote
1 answer

PyFlink 14.2 - Table API DDL - Semantic Exactly Once

I've had the scenario where I define a kafka source, UDF | UDTF for processing and sink to a Kafka sink. Doesn't matter what I do, if I run the job, the output is flood with the processed output of a single input record. For illustrative purposes,…
Paul
  • 756
  • 1
  • 8
  • 22
1
vote
0 answers

flinksql read custom format data with json

I am trying to read streaming data from Kafka, where streaming logs with the format as below 03-06-2022 02:56:130 INFO [...] [...] {"a": "abc", "b": 123} I want to extract the JSON part from the logs, which is the {"a": "abc", "b": 123} part, and…
1
vote
1 answer

PyFlink "pipeline.classpaths" vs $FLINK_HOME/lib

What is the difference between class loading classes passed as part of PyFlink pipeline.classpath config and putting them into a $FLINK_HOME\lib directory? When I want to use flink-sql-connector-kafka-*.jar it works fine just passing it using…
literg
  • 482
  • 5
  • 13
1
vote
1 answer

The meaning of wait in execute_sql statement

I was wondering what is the difference and implications of executing sql statements in pyflink with and without wait() command: t_env.execute_sql(query) t_env.execute_sql(query).wait() I experimented with both, and see no difference in execution.
Dark Templar
  • 1,175
  • 13
  • 27
1
vote
1 answer

Flink developer's role from watermarks perspective

I am a newbie to Flink and came across an article that mentioned "A flink developer is responsible for moving event time forward by arranging the watermark in the stream". So, I figured out the possible answer for this. As per my knowledge, if I…
whatsinthename
  • 1,828
  • 20
  • 59
1
vote
0 answers

PyFlink reading Parquet files

I'm happily reading text files via env.read_text_file(file_path), but how can I read a parquet file in PyFlink? I'm aware of https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/dataset/formats/parquet/ for Java/Scala ...but is…
1
vote
1 answer

pyflink tableAPI, multiple sources to single processing table sequence

I'm trying to implement a pyflink job (Via Table API) which does some basic processing from multiple sources, after the data from the sources gets converted into a standard format. I'm able to convert the data from each respective source into the…
Paul
  • 756
  • 1
  • 8
  • 22
1
vote
0 answers

Pyflink windowAll() by event-time to apply a clutering model

I'm a beginner on pyflink framework and I would like to know if my use case is possible with it ... I need to make a tumbling windows and apply a python udf (scikit learn clustering model) on it. The use case is : every 30 seconds I want to apply…
1
vote
2 answers

Invalid SQL identifier - org.apache.flink.sql.parser.impl.ParseException: Encountered "TABLE" at line 2, column 16

I am trying to run a PyFlink Job that takes data from source Kafka topic sinks it into hdfs. There is a weird SQL-related error that keeps arising. This is from SQL statement in Apache-Flink (PyFlink) Table API Sink: SQL: sql_statement_sink = """ …
3awny
  • 319
  • 1
  • 2
  • 10
1
vote
1 answer

PyFlink SQL local test

So I have a simple aggregation job written in PyFlink SQL API. The job read data from AWS kinesis and output result to Kinesis. I am curious if I can unit-test my pipeline with say pytest? I am guessing I need mock the source and sink with…
Alfred
  • 1,709
  • 8
  • 23
  • 38
1
vote
0 answers

Does pyflink have a max function or max_by function?

I can't find the corresponding function max or max_ by ** This is my message ** Java public class TransformTest2_Rolling { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env =…