Apache Beam's TextIO can be used to read JSON files in some filesystems, but how can I create a PCollection out of a large JSON (InputStream) resulted from a HTTP response in Java SDK?
Asked
Active
Viewed 1,390 times
1 Answers
0
I don't think there's a generic built-in solution in Beam to do this at the moment, see the list of supported IOs.
I can think of multiple approaches to this, whichever works for you may depend on your requirements:
- I would probably first try to build another layer (probably not in Beam) that saves the HTTP output into a GCS bucket (maybe splitting it into multiple files in the process) and then use Beam's TextIO to read from the GCS bucket;
- depending on the properties of the HTTP source you can consider:
- writing your own
ParDo
that reads the whole response in a single step, splits it and outputs the split elements separately. Then further transforms would parse the JSON or do other stuff; - implementing you own source, that will be more complicated but probably work better for very large (unbounded) responses;
- writing your own

Anton
- 2,431
- 10
- 20
-
Thanks, I'll definitely take a look at these options – d4nielfr4nco Nov 08 '18 at 14:45