Questions tagged [orc]

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

Optimized Row Columnar (ORC) file format is based on Hive’s RCFile which was the standard format for storing tabular data in Hadoop for several years. ORC was introduced in Hive 0.11.

References

Related Tags

470 questions
95
votes
6 answers

Parquet vs ORC vs ORC with Snappy

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in…
Rahul
  • 2,354
  • 3
  • 21
  • 30
45
votes
5 answers

Aggregating multiple columns with custom function in Spark

I was wondering if there is some way to specify a custom aggregation function for spark dataframes over multiple columns. I have a table like this of the type (name, item, price): john | tomato | 1.99 john | carrot | 0.45 bill | apple | 0.99 john…
anthonybell
  • 5,790
  • 7
  • 42
  • 60
14
votes
2 answers

Difference between 'Stored as InputFormat, OutputFormat' and 'Stored as' in Hive

Issue when executing a show create table and then executing the resulting create table statement if the table is ORC. Using show create table, you get this: STORED AS INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’ OUTPUTFORMAT …
Jason
  • 173
  • 1
  • 1
  • 8
13
votes
6 answers

How to read an ORC file stored locally in Python Pandas?

Can I think of an ORC file as similar to a CSV file with column headings and row labels containing data? If so, can I somehow read it into a simple pandas dataframe? I am not that familiar with tools like Hadoop or Spark, but is it necessary to…
Della
  • 1,264
  • 2
  • 15
  • 32
11
votes
1 answer

Spark: Save Dataframe in ORC format

In the previous version, we used to have a 'saveAsOrcFile()' method on RDD. This is now gone! How do I save data in DataFrame in ORC File format? def main(args: Array[String]) { println("Creating Orc File!") val sparkConf = new…
DilTeam
  • 2,551
  • 9
  • 42
  • 69
9
votes
3 answers

How do I Combine or Merge Small ORC files into Larger ORC file?

Most questions/answers on SO and the web discuss using Hive to combine a bunch of small ORC files into a larger one, however, my ORC files are log files which are separated by day and I need to keep them separate. I only want to "roll-up" the ORC…
Chris C
  • 1,012
  • 2
  • 12
  • 19
8
votes
3 answers

Convert Pandas dataframe from/to ORC file

Is it possible to convert a Pandas dataframe from/to an ORC file? I can transform the df in a parquet file, but the library doesn't seem to have ORC support. Is there an available solution in Python? If not, what could be the best strategy? One…
alcor
  • 515
  • 1
  • 8
  • 21
7
votes
2 answers

How can I convert local ORC files to CSV?

I have an ORC file on my local machine and I need any reasonable format from it (e.g. CSV, JSON, YAML, ...). How can I convert ORC to CSV?
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
7
votes
1 answer

Spark Structured Streaming Writestream to Hive ORC Partioned External Table

I am trying to use Spark Structured Streaming - writeStream API to write to an External Partitioned Hive table. CREATE EXTERNAL TABLE `XX`( `a` string, `b` string, `b` string, `happened` timestamp, `processed` timestamp, `d` string, `e` string, `f`…
7
votes
0 answers

Disable creation of .orc.crc file

I am working with the apache orc-core java api. I have noticed a couple of things and was wondering if there are options to control them Does not overwrite files. The call to OrcFile.createWriter fails if the specified file already exists. Is there…
Sodved
  • 8,428
  • 2
  • 31
  • 43
7
votes
1 answer

Merge delta data into an external table using hive's merge statement

I have an external table mapped in Hive (v2.3.2 on EMR-5.11.0) that I need to update with new data around once a week. The merge consists of a conditional upsert statement. The table's location is in s3, and the data is always there (created once,…
Meori Lehr
  • 193
  • 10
7
votes
2 answers

How do I use Spark ORC indexes?

What is the option to enable orc indexing from spark? df .write() .option("mode", "DROPMALFORMED") .option("compression", "snappy") .mode("overwrite") .format("orc") …
ForeverConfused
  • 1,607
  • 3
  • 26
  • 41
6
votes
3 answers

How to keep partition columns when reading in ORC files in Spark

When reading in an ORC file in Spark, if you specify the partition column in the path, that column will not be included in the dataset. For example, if we have val dfWithColumn = spark.read.orc("/some/path") val dfWithoutColumn =…
alexgbelov
  • 3,032
  • 4
  • 28
  • 42
6
votes
4 answers

How to create a Schema file in Spark

I am trying to read a Schema file (which is a text file) and apply it to my CSV file without a header. Since I already have a schema file I don't want to use InferSchema option which is an overhead. My input schema file looks like below, "num…
Gladiator
  • 354
  • 3
  • 19
6
votes
1 answer

Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??

I have searched through every documentation and still didn't find why there is a prefix and what is c000 in the below file naming convention: file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319- c000.snappy.parquet
Prabhakar Reddy
  • 4,628
  • 18
  • 36
1
2 3
31 32