-1

H,

I am looking for any example for schema validation for data.

Is it possible to do using cascading or scalding.

For example

Name:String , Age:Int

We say our data should confirm to above schema

then we can validate if data really is of that type

Thanks

user2230605
  • 2,390
  • 6
  • 27
  • 45

2 Answers2

0

In Scala, you can declare your type:

case class Record(name: String, age: Int)

then suppose your data is in CSV, use the Scalding Type-safe API:

val lines: TypedPipe[String] = TypedPipe.from(TextLine("data.csv"))
val records: TypedPipe[Record] = lines.map{ line =>
  line.split(",").toList match {
    case List(name, age) => Record(name, age.toInt)
    case _ => throw new BadSchemaException(line)
  }
}

If you want to trap bad records and do not fail your job, see addTrap method in RichPipe class.

Gianmario Spacagna
  • 1,270
  • 14
  • 12
0

Traps are not intended for application flow control. Setting a boolean field based on your application validation rules and filtering out the bad records is a better approach. I recommend you to read:

http://docs.cascading.org/cascading/2.6/userguide/html/ch08s03.html http://docs.cascading.org/cascading/2.6/userguide/html/ch11s09.html#handling-bad-data

technotring
  • 187
  • 1
  • 5