I have a bunch of csv files stored in the blob storage that contains records like this:
2016-04-19 20:26:01.0299,+05:30,ecc84966-9bc0-4bef-9cd2-ad79c25be278,test001,178.03499442294,,Good
2016-04-19 20:26:02.0303,+05:30,ecc84966-9bc0-4bef-9cd2-ad79c25be278,test001,160.205223861246,,Good
I have created an External Hive table with the following command
CREATE EXTERNAL TABLE my_history (
DataTimestamp Timestamp,
TimezoneOffset String,
SystemGuid String,
TagName String,
NumericValue Double,
StringValue String
)
PARTITIONED BY (year int, month int, day int, hour int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION 'wasb://mycontainer@mystorage.blob.core.windows.net/';
and have managed to add many partition like below for a month worth of data
ALTER TABLE my_history ADD IF NOT EXISTS PARTITION (year=2016, month = 03, day= 16, hour=00) LOCATION "Year=2016/Month=03/Day=16/Hour=00"
there are around 135,733,286 records in the table, at least that's what the following Hive Query of select count(*) from my_history
says.
Now I have following 2 issues:
1. Jupyter Hangs
when I execute a query like this hiveContext.sql("select count(*) from my_history").show()
I get no results, not even exception, where as running the same from the Hive gives me 135,733,286 as result after a long long time say 400+ sec.
2. Slow Results
I tried a simple duplicate query on Hive like this
SELECT
my_history.DataTimestamp,
my_history.TagName,
COUNT(*) as count,
MIN(my_history.NumericValue) as min_value,
MAX(my_history.NumericValue) as max_value
FROM
default.my_history
WHERE
my_history.TagName = 'test021'
GROUP BY
my_history.TagName,
my_history.DataTimestamp
HAVING
count > 1;
it takes close to 450 seconds to return result, I kind of expected it to return results in a fraction of that time as i have close to 60 cores on my HDInsight cluster. Running it from Jupyter again didn't yeld any results nor running the same query multiple times improved the performance as I have read that Spark caches the rdd for the next query.
what am I missing here?
Thanks Kiran