How do I run tpc-ds data generation (dsdgen
) and then run queries on these data (dsqgen
) in a parallel distributed mode. I am using Spark on yarn configuration (spark.master yarn
) and storing data on a burst buffer storage system.
Asked
Active
Viewed 663 times
0

vladimir
- 13,428
- 2
- 44
- 70

user9332151
- 47
- 1
- 8
1 Answers
0
Please check my current exploration @ https://github.com/dhiraa/spark-tpcds. t There is application under t which can be used to generate data in parallel.
Or you could check out my reference @ https://github.com/maropu/spark-tpcds-datagen
In both the cases don't forget to use the option "--partition-tables" to make use of the parallel generation.

Mageswaran
- 440
- 3
- 6