0

Is it possible to use a JSON file sink in the Table API and/or DataStream API the same way as for CSV ?

Thanks !

Code

my_sink_ddl = f"""
    create table mySink (
        id STRING,
        dummy_item STRING
    ) with (
        'connector.type' = 'filesystem',
        'format.type' = 'json',
        'connector.path' = 'output.json'
    )
"""

Error

TableException: findAndCreateTableSink failed.
py-r
  • 419
  • 5
  • 15

1 Answers1

4

Yes, according to the Jira FLINK-17286 Integrate json to file system connector and the corresponding pull request [FLINK-17286][connectors / filesystem]Integrate json to file system connector #12010, it is possible starting from Flink 1.11. Prior to Flink 1.11 I believe it was not supported.

You need to use following config:

... with (
        'connector' = 'filesystem',
        'format' = 'json',
        'path' = 'output_json' -- This must be a directory
    )

Plus following environment definition:

t_env = BatchTableEnvironment.create(   environment_settings=EnvironmentSettings.new_instance().in_batch_mode().use_blink_planner().build()) 
py-r
  • 419
  • 5
  • 15
Mikalai Lushchytski
  • 1,563
  • 1
  • 9
  • 18
  • Thanks for your reply Mikalai. Any idea how to correctly specify the sink then ? – py-r Nov 06 '20 at 10:38
  • Please, have a look on this doc for 1.11 flink, it might help: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/filesystem.html From what I now - it should be with ( 'connector' = 'filesystem', 'format' = 'json', 'path' = 'output.json' ) – Mikalai Lushchytski Nov 06 '20 at 11:16
  • Using those attribute names leads now to another error: TableException: BatchTableSink or OutputFormatTableSink required to emit batch Table. – py-r Nov 06 '20 at 12:52
  • Could you please confirm you are using flink 1.11 ? – Mikalai Lushchytski Nov 06 '20 at 12:55
  • Also, according to the documentation, the path parameter is specified for a directory not for a file and you can’t get a human-readable file in the path that you declare. So, `path` should be a directory. – Mikalai Lushchytski Nov 06 '20 at 13:01
  • Version is 1.11.2. I just changed `path` to empty directory using `file:///path/to/whatever` syntax. Same error. – py-r Nov 06 '20 at 13:15
  • What environment do you use, `BatchTableEnvironment` ? Could you please provide the environment configuration peace of code ? – Mikalai Lushchytski Nov 06 '20 at 13:18
  • `env = ExecutionEnvironment.get_execution_environment()` `t_config = TableConfig()` `t_env = BatchTableEnvironment.create(env, t_config)` – py-r Nov 06 '20 at 13:25
  • 1
    COuld you please try this one instead ? `t_env = BatchTableEnvironment.create( environment_settings=EnvironmentSettings.new_instance().in_batch_mode().use_blink_planner().build())` – Mikalai Lushchytski Nov 06 '20 at 13:56
  • If you have more info, feel free to share. Thanks again Mikalai ! – py-r Nov 06 '20 at 19:14
  • I think the key point is the table planner selected - blink vs flink. My understanding is that the old planner (flink) does not support new converters and new syntax and this causes the error. – Mikalai Lushchytski Nov 07 '20 at 07:00
  • Also, the default execution mode is streaming, so you need to explicitly set it to batch. I'd expect that the default planner is Blink, though, since it is the default in the Table API from Flink 1.10. Like Mikalai said, most new features are not backported to the old planner, but built on the "new" Blink planner instead. Thanks for working through the hiccups! I'll patch up the PyFlink documentation as it's really not clear for batch jobs. – morsapaes Nov 09 '20 at 10:58