I am running Hive 0.14 on a HDP2 cluster. My dataset was built using the kite sdk and registered to Hive using external tables.
See my table layout below:
hive> describe hivetweets;
OK
created_at bigint from deserializer
id bigint from deserializer
in_reply_to_user_id bigint from deserializer
in_reply_to_status_id bigint from deserializer
lang string from deserializer
text string from deserializer
retweet_count int from deserializer
year int Partition column derived from 'created_at' column, generated by Kite.
month int Partition column derived from 'created_at' column, generated by Kite.
day int Partition column derived from 'created_at' column, generated by Kite.
hour int Partition column derived from 'created_at' column, generated by Kite.
# Partition Information
# col_name data_type comment
year int Partition column derived from 'created_at' column, generated by Kite.
month int Partition column derived from 'created_at' column, generated by Kite.
day int Partition column derived from 'created_at' column, generated by Kite.
hour int Partition column derived from 'created_at' column, generated by Kite.
Time taken: 0.15 seconds, Fetched: 19 row(s)
My initial test query to this setup was to get just a single row of the dataset (I removed the actual output in the example):
hive> select * from hivetweets limit 1;
OK
Time taken: 103.726 seconds, Fetched: 1 row(s)
104 seconds to run this query is imho way to long.
This probably didn't run distributed so I tried to test it with more data:
hive> select count(*) from hivetweets limit 100000;
Query ID = root_20150715132222_81e386ef-2990-4251-a61f-82ca8da4c48d
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (Executing on YARN cluster with App id application_1436910684121_0006)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 19 19 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 567.52 s
--------------------------------------------------------------------------------
OK
197371741
Counting 100k records in 10 minutes is way to long to be reasonable.
I am happy with any recommendation how to debug this.