Questions tagged [orc]

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

Optimized Row Columnar (ORC) file format is based on Hive’s RCFile which was the standard format for storing tabular data in Hadoop for several years. ORC was introduced in Hive 0.11.

References

Related Tags

hive
parquet

470 questions

votes

6 answers

Parquet vs ORC vs ORC with Snappy

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in…

hadoop hive parquet snappy orc

asked Sep 03 '15 at 10:45

Rahul

2,354
3
21
30

votes

5 answers

Aggregating multiple columns with custom function in Spark

scala apache-spark dataframe apache-spark-sql orc

asked Jun 09 '16 at 23:38

anthonybell

5,790
7
42
60

votes

2 answers

Difference between 'Stored as InputFormat, OutputFormat' and 'Stored as' in Hive

Issue when executing a show create table and then executing the resulting create table statement if the table is ORC. Using show create table, you get this: STORED AS INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’ OUTPUTFORMAT …

hadoop hive hiveql orc hive-serde

asked Jun 08 '17 at 19:03

Jason

votes

6 answers

How to read an ORC file stored locally in Python Pandas?

Can I think of an ORC file as similar to a CSV file with column headings and row labels containing data? If so, can I somehow read it into a simple pandas dataframe? I am not that familiar with tools like Hadoop or Spark, but is it necessary to…

python pandas pyspark data-science orc

asked Oct 19 '18 at 09:33

Della

1,264
2
15
32

votes

1 answer

Spark: Save Dataframe in ORC format

In the previous version, we used to have a 'saveAsOrcFile()' method on RDD. This is now gone! How do I save data in DataFrame in ORC File format? def main(args: Array[String]) { println("Creating Orc File!") val sparkConf = new…

scala apache-spark apache-spark-sql orc

asked Sep 16 '15 at 19:13

DilTeam

2,551
9
42
69

votes

3 answers

How do I Combine or Merge Small ORC files into Larger ORC file?

Most questions/answers on SO and the web discuss using Hive to combine a bunch of small ORC files into a larger one, however, my ORC files are log files which are separated by day and I need to keep them separate. I only want to "roll-up" the ORC…

java hive hdfs orc

asked Apr 26 '18 at 11:48

Chris C

1,012
2
12
19

votes

3 answers

Convert Pandas dataframe from/to ORC file

Is it possible to convert a Pandas dataframe from/to an ORC file? I can transform the df in a parquet file, but the library doesn't seem to have ORC support. Is there an available solution in Python? If not, what could be the best strategy? One…

python pandas orc

asked Nov 06 '19 at 11:02

alcor

votes

2 answers

How can I convert local ORC files to CSV?

I have an ORC file on my local machine and I need any reasonable format from it (e.g. CSV, JSON, YAML, ...). How can I convert ORC to CSV?

csv orc

asked Feb 01 '19 at 15:49

Martin Thoma

124,992
159
614
958

votes

1 answer

Spark Structured Streaming Writestream to Hive ORC Partioned External Table

I am trying to use Spark Structured Streaming - writeStream API to write to an External Partitioned Hive table. CREATE EXTERNAL TABLE `XX`( `a` string, `b` string, `b` string, `happened` timestamp, `processed` timestamp, `d` string, `e` string, `f`…

apache-spark hive spark-structured-streaming orc hive-partitions

asked Aug 11 '18 at 22:29

irrelevantUser

1,172
18
35

votes

0 answers

Disable creation of .orc.crc file

I am working with the apache orc-core java api. I have noticed a couple of things and was wondering if there are options to control them Does not overwrite files. The call to OrcFile.createWriter fails if the specified file already exists. Is there…

java orc

asked Jan 24 '18 at 04:43

Sodved

8,428
2
31
43

votes

1 answer

Merge delta data into an external table using hive's merge statement

I have an external table mapped in Hive (v2.3.2 on EMR-5.11.0) that I need to update with new data around once a week. The merge consists of a conditional upsert statement. The table's location is in s3, and the data is always there (created once,…

hadoop hive emr acid orc

asked Jan 02 '18 at 13:07

Meori Lehr

votes

2 answers

How do I use Spark ORC indexes?

What is the option to enable orc indexing from spark? df .write() .option("mode", "DROPMALFORMED") .option("compression", "snappy") .mode("overwrite") .format("orc") …

apache-spark orc

asked Oct 29 '17 at 21:09

ForeverConfused

1,607
3
26
41

votes

3 answers

How to keep partition columns when reading in ORC files in Spark

When reading in an ORC file in Spark, if you specify the partition column in the path, that column will not be included in the dataset. For example, if we have val dfWithColumn = spark.read.orc("/some/path") val dfWithoutColumn =…

apache-spark apache-spark-sql orc

asked Sep 12 '18 at 20:23

alexgbelov

3,032
4
28
42

votes

4 answers

How to create a Schema file in Spark

I am trying to read a Schema file (which is a text file) and apply it to my CSV file without a header. Since I already have a schema file I don't want to use InferSchema option which is an overhead. My input schema file looks like below, "num…

scala apache-spark-sql schema orc

asked May 24 '18 at 04:08

Gladiator

votes

1 answer

Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??

I have searched through every documentation and still didn't find why there is a prefix and what is c000 in the below file naming convention: file:/Users/stephen/p/spark/f1/part-00000-445036f9-7a40-4333-8405-8451faa44319- c000.snappy.parquet

hadoop apache-spark hive parquet orc

asked Mar 08 '18 at 04:57

Prabhakar Reddy

4,628
18
36

2 3

…

31 32 Next