0

I have a bunch of csv files stored in the blob storage that contains records like this:

2016-04-19 20:26:01.0299,+05:30,ecc84966-9bc0-4bef-9cd2-ad79c25be278,test001,178.03499442294,,Good
2016-04-19 20:26:02.0303,+05:30,ecc84966-9bc0-4bef-9cd2-ad79c25be278,test001,160.205223861246,,Good

I have created an External Hive table with the following command

CREATE EXTERNAL TABLE my_history  (
DataTimestamp Timestamp, 
TimezoneOffset String, 
SystemGuid String, 
TagName String, 
NumericValue Double, 
StringValue String
)
PARTITIONED BY (year int, month int, day int, hour int) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE LOCATION 'wasb://mycontainer@mystorage.blob.core.windows.net/';

and have managed to add many partition like below for a month worth of data

ALTER TABLE my_history ADD IF NOT EXISTS PARTITION (year=2016, month = 03, day= 16, hour=00)  LOCATION "Year=2016/Month=03/Day=16/Hour=00" 

there are around 135,733,286 records in the table, at least that's what the following Hive Query of select count(*) from my_history says.

Now I have following 2 issues:

1. Jupyter Hangs

when I execute a query like this hiveContext.sql("select count(*) from my_history").show() I get no results, not even exception, where as running the same from the Hive gives me 135,733,286 as result after a long long time say 400+ sec.

2. Slow Results

I tried a simple duplicate query on Hive like this

SELECT 
                      my_history.DataTimestamp, 
                      my_history.TagName,
                      COUNT(*) as count,
                      MIN(my_history.NumericValue) as min_value,
                      MAX(my_history.NumericValue) as max_value
                   FROM 
                      default.my_history
                   WHERE 
                       my_history.TagName = 'test021'
                   GROUP BY
                      my_history.TagName,
                      my_history.DataTimestamp
                    HAVING 
                        count > 1;

it takes close to 450 seconds to return result, I kind of expected it to return results in a fraction of that time as i have close to 60 cores on my HDInsight cluster. Running it from Jupyter again didn't yeld any results nor running the same query multiple times improved the performance as I have read that Spark caches the rdd for the next query.

what am I missing here?

Thanks Kiran

Kiran
  • 2,997
  • 6
  • 31
  • 62
  • Didi you try to store your data with some smarter format than TextFile with no compression? E.g. ORC or Parquet with GZip or Snappy? You might see a huge reduction on I/O (because of columnar storage +compression) and probably on CPU (less I/O waits, faster de-serialization, although de-compression would cost). – Samson Scharfrichter Apr 21 '16 at 21:50
  • About Jupyter "hanging": did you check the Spark logs in the `jupyter` console?? Spark is a very verbose beast. If the driver is waiting for something, it should show there. And if the driver has crashed or gone zombie, it will certainly show. – Samson Scharfrichter Apr 21 '16 at 21:55

1 Answers1

0
  1. Jupyter may hang if there is no resources in Yarn to start new spark application for your notebook. In this case Jupyter will wait until resources are available. Resources may be consumed by other spark applications from other notebooks. Check Yarn UI to see if there are other applications running, and if there are available resources. You can kill other applications from this UI. Or in case of notebooks you can shut down them using Jupyter "Running notebooks" UI.

  2. Slow queries may be caused by many issues. First thing to check is to make sure your spark application uses all available cores in Yarn. In Preview notebooks are provided with around 25% of resources. You can change that allocation using %%configure command. Set number of cores to 4 and number of executors to 15: %%configure -f

    {"name":"remotesparkmagics-sample", "executorMemory": "12G", "executorCores":4, "numExecutors":15} This should give all 60 cores to your application.

maxiluk
  • 371
  • 1
  • 3
  • Thanks for the configure tip, now I see yearn memory shoot up to 91%. However its still not returning results after 800 seconds, where as the Hive gave me results in 450 seconds. BTW I don't think there is any issue with the Resources, I had checked and there is only one `remotesparkmagics` there and I am not running anything else on it. – Kiran Apr 21 '16 at 18:04
  • It's not clear from your response does your query still hangs or just slow? – maxiluk Apr 27 '16 at 04:36
  • Yes It still hangs, and I think the issue is with the data being in an external table in a different resource group. if I move the data over to our managed table in the same resource group, things work as expected. – Kiran Apr 28 '16 at 04:09