0

I am running Hive 0.14 on a HDP2 cluster. My dataset was built using the kite sdk and registered to Hive using external tables.

See my table layout below:

hive> describe hivetweets;
OK
created_at              bigint                  from deserializer
id                      bigint                  from deserializer
in_reply_to_user_id     bigint                  from deserializer
in_reply_to_status_id   bigint                  from deserializer
lang                    string                  from deserializer
text                    string                  from deserializer
retweet_count           int                     from deserializer
year                    int                     Partition column derived from 'created_at' column, generated by Kite.
month                   int                     Partition column derived from 'created_at' column, generated by Kite.
day                     int                     Partition column derived from 'created_at' column, generated by Kite.
hour                    int                     Partition column derived from 'created_at' column, generated by Kite.

# Partition Information
# col_name              data_type               comment

year                    int                     Partition column derived from 'created_at' column, generated by Kite.
month                   int                     Partition column derived from 'created_at' column, generated by Kite.
day                     int                     Partition column derived from 'created_at' column, generated by Kite.
hour                    int                     Partition column derived from 'created_at' column, generated by Kite.
Time taken: 0.15 seconds, Fetched: 19 row(s)

My initial test query to this setup was to get just a single row of the dataset (I removed the actual output in the example):

hive> select * from hivetweets limit 1;
OK
Time taken: 103.726 seconds, Fetched: 1 row(s)

104 seconds to run this query is imho way to long.

This probably didn't run distributed so I tried to test it with more data:

hive> select count(*) from hivetweets limit 100000;
Query ID = root_20150715132222_81e386ef-2990-4251-a61f-82ca8da4c48d
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.


Status: Running (Executing on YARN cluster with App id application_1436910684121_0006)

--------------------------------------------------------------------------------

VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED     19         19        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 567.52 s
--------------------------------------------------------------------------------
OK
197371741

Counting 100k records in 10 minutes is way to long to be reasonable.

I am happy with any recommendation how to debug this.

dominik
  • 613
  • 2
  • 6
  • 10
  • This is not possible to diagnose with the data you posted. At the very least post the Hive query explain and the the job logs. also make sure you format your kite datasets as Parquet, not as CSV or JSON. – Remus Rusanu Jul 15 '15 at 12:40
  • Does it have to be in Parquet? Currently I used avro for the records. I can provide the the query explain a bit later but just noticed one thing. How can this query: `select count(*) from hivetweets limit 100000;` return 197371741 as a result? Seems like its parsing and counting the whole dataset?! – dominik Jul 15 '15 at 14:12
  • That query is absolutely parsing the whole dataset and then returns the count. It *then* limits the resilts to 100000. This is SQL. – Remus Rusanu Jul 15 '15 at 14:36
  • Haven't done sql in ages and feel stupid now -.- Thanks for the hint! – dominik Jul 15 '15 at 14:56
  • 2
    As about Parquet vs. Avro: Parquet is absolutely better for Hive. Is columnar. ORC would be best, but I'm not sure Kite supports it. – Remus Rusanu Jul 15 '15 at 18:38
  • When doing a `select * from ... ` query that does not involve aggregation functions or complex filters, Hive will often execute this directly on the metastore (read: single machine) and not use MR/YARN. This might explain why the first query is slow. Golden rule with external, partitioned tables is: ALWAYS specify partitions in the where clause! – LiMuBei Jul 16 '15 at 12:44

0 Answers0