4

Let's imagine I store one file per day in a format:

/path/to/files/2016/07/31.csv
/path/to/files/2016/08/01.csv
/path/to/files/2016/08/02.csv

How can I read the files in a single Hive table for a given date range (for example from 2016-06-04 to 2016-08-03)?

Dmitry Petrov
  • 1,490
  • 1
  • 19
  • 34

3 Answers3

2

Assuming every files follow the same schema, I would then suggest that you store the files with the following naming convention :

/path/to/files/dt=2016-07-31/data.csv
/path/to/files/dt=2016-08-01/data.csv
/path/to/files/dt=2016-08-02/data.csv

You could then create an external table partitioned by dt and pointing to the location /path/to/files/

CREATE EXTERNAL TABLE yourtable(id int, value int)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/path/to/files/'

If you have several partitions and don't want to write alter table yourtable add partition ... queries for each one, you can simply use the repair command that will automatically add partitions.

msck repair table yourtable

You can then simply select data within a date range by specifying the partition range

SELECT * FROM yourtable WHERE dt BETWEEN '2016-06-04' and '2016-08-03'
cheseaux
  • 5,187
  • 31
  • 50
  • Thank you. Interesting solution... Let's wait for a while I hope to see a solution with no file name modifications. Btw... could you please clarify, are you suggesting to read all files and optimize next SELECT by the partitioning? – Dmitry Petrov Aug 04 '16 at 09:32
  • 1
    An external table will not read all the files, it does read only when you query the table and will look only at the partitions (files) you specify in your query. – cheseaux Aug 04 '16 at 09:43
  • I've already asked @waltersu - what is an overhead of reading, let's imagine, 2000 years of logs partitioned by dates? – Dmitry Petrov Aug 04 '16 at 09:49
  • What do you mean by "reading" ? Like I've said, the external table doesn't read anything until you issue a query, and then it will only read the data it needs (the partitions you specified) – cheseaux Aug 04 '16 at 09:51
  • Just to confirm... Does it means that only 61 files will be opened for reading out of ~2000*12 files? – Dmitry Petrov Aug 04 '16 at 09:54
  • @DmitryPetrov I've edited my answer, there was a small mistake by the file naming convention. The filenames don't matter, it's the parent folder `dt=...` that indicate the partition name. – cheseaux Aug 04 '16 at 10:39
0

Without moving your file:

  1. Design your table schema. In hive shell, create the table (partitioned by date)
  2. Loading files into tables
  3. Query with HiveQL ( select * from table where dt between '2016-06-04 ' and '2016-08-03')

Moving your file:

  1. Design your table schema. In hive shell, create the table (partitioned by date)
  2. move /path/to/files/2016/07/31.csv under /dbname.db/tableName/dt=2016-07-31, then you'll have /dbname.db/tableName/dt=2016-07-31/file1.csv /dbname.db/tableName/dt=2016-08-01/file1.csv /dbname.db/tableName/dt=2016-08-02/file1.csv

  3. load partition with alter table tableName add partition (dt=2016-07-31);

See Add partitions

waltersu
  • 1,191
  • 8
  • 20
  • Is it possible to avoid loading all files? Let's imagine I have logs for 2000 years :) and query them just by a couple of weeks or months. What is an overhead of reading 2000 years of (partitioned by dates) logs? – Dmitry Petrov Aug 04 '16 at 09:44
0

In Spark-shell, read hive table

/path/to/data/user_info/dt=2016-07-31/0000-0

1.create sql

val sql = "CREATE EXTERNAL TABLE `user_info`( `userid` string, `name` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://.../data/user_info'"

2. run it

spark.sql(sql)

3.load data

val rlt= spark.sql("alter table user_info add partition (dt=2016-09-21)")

4.now you can select data from table

val df = spark.sql("select * from user_info")