4

I am trying to write streaming JSON messages directly to Parquet using Scala (no Spark). I see only couple of post online and this post, however I see the ParquetWriter API is deprecated and the solution doesn't actually provides an example to follow. I read some other posts too but didn't find any descriptive explanation.

I know I have to use ParquetFileWriter API but lack of documentation is making difficult for me to use it. Can someone please provide and example of it along with all the constructor parameter and how to create those parameter, especially schema?

stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
Explorer
  • 1,491
  • 4
  • 26
  • 67

1 Answers1

2

You may want to try using Eel, a a toolkit to manipulate data in the Hadoop ecosystem.

I recommend reading the README to gain a better understanding of the library, but to give you a sense of how the library works, what your are trying to do would look somewhat like the following:

val source = JsonSource(() => new FileInputStream("input.json"))
val sink = ParquetSink(new Path("output.parquet"))
source.toDataStream().to(sink)
stefanobaghino
  • 11,253
  • 4
  • 35
  • 63
  • Aah it works on `Streams`, I am using Akka Streams in my application so it will be easy to integrate. I also found `akka-stream-alpakka-avroparquet` from Lightbend. I am trying both right now. Thanks for your suggestion. – Explorer Oct 10 '18 at 19:15