0

How do data processing engine like Spark, apache flink integrate structured, semi-structured and unstructured data together and affect computation?

  • "too broad": There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs. – maasg Apr 12 '15 at 19:20

1 Answers1

1

General-purpose data processing engines like Flink or Spark let you define own data types and functions.

In case you have unstructured or semi-structured data, your data types can reflect these properties, e.g., by making some information optional or model it with flexible data structures (nested types, lists, maps, etc.). Your user-defined functions should be aware that some information might not always be present and know how to handle such cases.

So handling of semi-structured or unstructured data does not come for free. It must be explicitly specified. In fact, both systems put a focus on user-defined data and functions but have recently added APIs to ease the processing of structured data (Flink: Table API, Spark: DataFrames).

Fabian Hueske
  • 18,707
  • 2
  • 44
  • 49