Slow Hive Query Performance under AWS Elastic MapReduce

Question

There's a strange problem I'm experiencing, and I assure you I've googled a lot.

I'm running a set of AWS Elastic MapReduce Clusters, and I have a Hive Table with about 16 partitions. They're created from emr-s3distcp (since there are about 216K files in the original s3 bucket), with the --groupBy and with limit set to 64MiB (the DFS Block Size in this case) and they're simply text files with a json object on each line, using a JSON SerDe.

When I run this script, it takes ages, then it gives up due to some IPC Connections.

Originally, the strain from s3distcp to HDFS was so high that I've took some measures (read: resizing to higher-capacity machines, then setting dfs permissions to 3-fold replication as it is a small cluster, and block size set to 64MiB). That worked and the number of under-replicated blocks turned to zero (default for less-than-3 in EMR is 2, but I've changed to 3).

Looking at /mnt/var/log/apps/hive_081.log yields seveeral lines like this:

2013-05-12 09:56:12,120 DEBUG org.apache.hadoop.ipc.Client (Client.java:<init>(222)) - The ping interval is60000ms.
2013-05-12 09:56:12,120 DEBUG org.apache.hadoop.ipc.Client (Client.java:<init>(265)) - Use SIMPLE authentication for protocol ClientProtocol
2013-05-12 09:56:12,120 DEBUG org.apache.hadoop.ipc.Client (Client.java:setupIOstreams(551)) - Connecting to /10.17.17.243:9000
2013-05-12 09:56:12,121 DEBUG org.apache.hadoop.ipc.Client (Client.java:sendParam(769)) - IPC Client (47) connection to /10.17.17.243:9000 from hadoop sending #14
2013-05-12 09:56:12,121 DEBUG org.apache.hadoop.ipc.Client (Client.java:run(742)) - IPC Client (47) connection to /10.17.17.243:9000 from hadoop: starting, having connections 2
2013-05-12 09:56:12,125 DEBUG org.apache.hadoop.ipc.Client (Client.java:receiveResponse(804)) - IPC Client (47) connection to /10.17.17.243:9000 from hadoop got value #14
2013-05-12 09:56:12,126 DEBUG org.apache.hadoop.ipc.RPC (RPC.java:invoke(228)) - Call: getFileInfo 6
2013-05-12 09:56:21,523 INFO  org.apache.hadoop.ipc.Client (Client.java:handleConnectionFailure(663)) - Retrying connect to server: domU-12-31-39-10-81-2A.compute-1.internal/10.198.130.216:9000. Already tried 6 time(s).
2013-05-12 09:56:22,122 DEBUG org.apache.hadoop.ipc.Client (Client.java:close(876)) - IPC Client (47) connection to /10.17.17.243:9000 from hadoop: closed
2013-05-12 09:56:22,122 DEBUG org.apache.hadoop.ipc.Client (Client.java:run(752)) - IPC Client (47) connection to /10.17.17.243:9000 from hadoop: stopped, remaining connections 1
2013-05-12 09:56:42,544 INFO  org.apache.hadoop.ipc.Client (Client.java:handleConnectionFailure(663)) - Retrying connect to server: domU-12-31-39-10-81-2A.compute-1.internal/10.198.130.216:9000. Already tried 7 time(s).

And so on and on until one of the clients hits a limit.

What does it take to fix this in Hive under Elastic MapReduce?

Thanks

score 0 · Answer 1 · answered May 12 '13 at 10:55

After a while, I noticed: The offending IP Address wasn't even in my cluster, so it was a stuck hive metastore. I've fixed that by:

CREATE TABLE whatever_2 LIKE whatever LOCATION <hdfs_location>;

ALTER TABLE whetever_2 RECOVER PARTITIONS;

Hope it helps.

Slow Hive Query Performance under AWS Elastic MapReduce

1 Answers1