7

What is the option to enable orc indexing from spark?

          df
            .write()
            .option("mode", "DROPMALFORMED")
            .option("compression", "snappy")
            .mode("overwrite")
            .format("orc")
            .option("index", "user_id")
            .save(...);

I'm making up .option("index", uid), what would I have to put there to index column "user_id" from orc.

ForeverConfused
  • 1,607
  • 3
  • 26
  • 41

2 Answers2

2

Have you tried : .partitionBy("user_id") ?

 df
        .write()
        .option("mode", "DROPMALFORMED")
        .option("compression", "snappy")
        .mode("overwrite")
        .format("orc")
        .partitionBy("user_id")
        .save(...)
Malik Fassi
  • 488
  • 1
  • 10
  • 25
  • I think partitionBy will create a new file per user, rather than create an index. But you're only one that answered so I give you the bounty. – ForeverConfused Nov 09 '17 at 11:26
  • @ForeverConfused i am researching on this. Will let you know soon. – loneStar Nov 09 '17 at 21:41
  • @Achyuth, have you found any approach to create index in ORC file? I found nothing till today. It seems to me the only way to leverage index in ORC file is using Hive. Please correct me if it's wrong. Thanks! – James Gan Feb 26 '18 at 19:45
  • @JamesGan Thats true, The file is already indexed on many thing, stats collumn offset etc. The best way to achieve the performance is by creating an orc files which are around 200 to 300 mb. – loneStar Feb 26 '18 at 20:08
1

According to the original blogpost on bringing ORC support to Apache Spark, there is a configuration knob to turn on in your spark context to enable ORC indexes.

# enable filters in ORC
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")

Reference: https://databricks.com/blog/2015/07/16/joint-blog-post-bringing-orc-support-into-apache-spark.html

louis_guitton
  • 5,105
  • 1
  • 31
  • 33