1

I'm trying to write output from a scalding flow in json form, and reading it in Spark. This is working fine, except if the json contains strings with new lines. The output is one json object per line, and newlines in a value on the json is causing one bit of json to be fragmented across two lines. As such, when I read lines into Spark, I can't deserialize some of them. Is there a standard way to deal with this?

ashic
  • 6,367
  • 5
  • 33
  • 54
  • If files are small you can use `wholeTextFiles`. Another option, which could be useful if data has predictable structure, is to use custom record delimiter. Finally you can always create custom input format. – zero323 Jan 04 '16 at 17:07
  • Unfortunately, each file is a few gigs. I'm looking to do custom, but was wondering if there's some sort of industry standard format. – ashic Jan 04 '16 at 17:14
  • Nothing that I am aware of. Problem with JSON is that it is not trivial to determine document boundaries in a general case. – zero323 Jan 04 '16 at 17:45

0 Answers0