2

i'm using Flink SQL and the following scheme shows my source data (belonging to some Twitter data):

CREATE TABLE `twitter_raw` (
    `entities` ROW(
        `hashtags` ROW(
            `text` STRING,
            `indices` INT ARRAY
        ) ARRAY,
        `urls` ROW(
            `indices` INT ARRAY,
            `url` STRING,
            `display_url` STRING,
            `expanded_url` STRING
        ) ARRAY,
        `user_mentions` ROW(
            `screen_name` STRING,
            `name` STRING,
            `id` BIGINT
        ) ARRAY
    )
)
WITH (...);

I want to get only the hashtags in a collection. Therefore i have to map the collection of constructed objects (ROW) to an array of STRING.

Like this scheme:

CREATE TABLE `twitter_raw` (
    `entities` ROW(
        `hashtags` STRING ARRAY,
        `urls` STRING ARRAY,
        `user_mentions` STRING ARRAY
    )
)
WITH (...);

How can i achieve this with Flink-SQL? Maybe built-in functions (JSON-functions?) or own UDF or do i have to write a DataStream Job?

Thanks in advance.

JanOels
  • 111
  • 11

2 Answers2

3

The SQL command UNNEST helps in this case. It is like EXPLODE in Spark.

You can solve it by creating a new row for each hashtag in the hashtags array:

SELECT hashtag, index
FROM twitter_raw 
CROSS JOIN UNNEST(hashtags) AS t (hashtag, index)
Peter Csala
  • 17,736
  • 16
  • 35
  • 75
JanOels
  • 111
  • 11
1

You can either define a computed row, or a VIEW, and then extracting the hashtags field using the dot notation. e.g.:

CREATE VIEW hashtags_raw (hashtags) AS 
  SELECT entities.hashtags AS hashtags FROM twitter_raw