2

Apache Beam's TextIO can be used to read JSON files in some filesystems, but how can I create a PCollection out of a large JSON (InputStream) resulted from a HTTP response in Java SDK?

d4nielfr4nco
  • 635
  • 1
  • 6
  • 17

1 Answers1

0

I don't think there's a generic built-in solution in Beam to do this at the moment, see the list of supported IOs.

I can think of multiple approaches to this, whichever works for you may depend on your requirements:

  • I would probably first try to build another layer (probably not in Beam) that saves the HTTP output into a GCS bucket (maybe splitting it into multiple files in the process) and then use Beam's TextIO to read from the GCS bucket;
  • depending on the properties of the HTTP source you can consider:
    • writing your own ParDo that reads the whole response in a single step, splits it and outputs the split elements separately. Then further transforms would parse the JSON or do other stuff;
    • implementing you own source, that will be more complicated but probably work better for very large (unbounded) responses;
Anton
  • 2,431
  • 10
  • 20