Structured and unstructured data integration with large scale data processing engine

Question

How do data processing engine like Spark, apache flink integrate structured, semi-structured and unstructured data together and affect computation?

"too broad": There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs. — maasg, Apr 12 '15 at 19:20

score 1 · Accepted Answer · answered Apr 12 '15 at 22:57

General-purpose data processing engines like Flink or Spark let you define own data types and functions.

In case you have unstructured or semi-structured data, your data types can reflect these properties, e.g., by making some information optional or model it with flexible data structures (nested types, lists, maps, etc.). Your user-defined functions should be aware that some information might not always be present and know how to handle such cases.

So handling of semi-structured or unstructured data does not come for free. It must be explicitly specified. In fact, both systems put a focus on user-defined data and functions but have recently added APIs to ease the processing of structured data (Flink: Table API, Spark: DataFrames).

Can I process the structured and unstructured data separately and then join the them (output) at the end! — , Apr 13 '15 at 09:19

Structured and unstructured data integration with large scale data processing engine

1 Answers1