0

I'm implementing spark(1.5.2) sql RelationProvider for custom data source (properties files).

Can some one please explain how automatic inference algorithm should be implemented?

Alexander.Furer
  • 1,817
  • 1
  • 16
  • 24

1 Answers1

0

In general, you need to create a StructType that represents your schema. A StructType contains an Array[StructField], where each element of the array corresponds to a column in your schema. A StructField can be any supported DataType -- including another StructType for nested schemas.

Creating a schema can be as simple as:

val schema = StructType(Array(
  StructField("col1", StringType),
  StructField("col2", LongType)
))

If you want to generate a schema from a complex dataset -- one that includes nested StructTypes -- then you most likely need to create a recursive function. A good example of what such a function looks like can be found in the spark-avro integration library. The function toSqlType takes an Avro schema and converts it into a Spark StructType.

David Griffin
  • 13,677
  • 5
  • 47
  • 65
  • Thanks @david-griffin, but I'm after **automatic schema discovery/inference**. How should should I sample the data and merge the schema ? – Alexander.Furer Jun 01 '16 at 05:37