We need to distribute the data using JSON and thus we wrote a custom outputter. We are also outputting same data as csv for another vendor. On investigation I found that JSON outputter is using one vertices whereas csv is using 5 vertices to output same data and JSON took long time as well. May I request the reason behind the behavior and is there a way so that we change this?
2 Answers
Actually the reason why you only get a single vertex for JSON but 5 vertices for CSV is very simple.
JSON is a hierarchical data format, and thus needs the whole rowset in a single vertex so it knows what the structure will be. Even if the outputter outputs a JSON array of objects representing the rows, the array begin and end is kind of a nesting (you will need to know what the first and last row is).
If you used the sample outputter from the Microsoft U-SQL GitHub page, that outputter was implemented with AtomicFileProcessing turned on for this reason.
CSV is a flat, row-by-row format. Thus you can partition the rowset into subsets and serialize them individually. There is no structure impeding parallelization.
So unless you decide to output 1 JSON document by row (thus turning the combined output into an invalid JSON document), you cannot parallelize the hierarchical output.

- 6,684
- 15
- 23
I believe you can add a wildcard to the output location that would turn the final COMBINE node from one vertex into the number of vertices equal to the stream at that point.
So, instead of outputting to
filename.json
you would use
filename{*}.json
and in place of the {*} would be a numeric value representing the vertex.