0

How do I run tpc-ds data generation (dsdgen) and then run queries on these data (dsqgen) in a parallel distributed mode. I am using Spark on yarn configuration (spark.master yarn) and storing data on a burst buffer storage system.

vladimir
  • 13,428
  • 2
  • 44
  • 70
user9332151
  • 47
  • 1
  • 8

1 Answers1

0

Please check my current exploration @ https://github.com/dhiraa/spark-tpcds. t There is application under t which can be used to generate data in parallel.

Or you could check out my reference @ https://github.com/maropu/spark-tpcds-datagen

In both the cases don't forget to use the option "--partition-tables" to make use of the parallel generation.

Mageswaran
  • 440
  • 3
  • 6