dbGetQuery throttles when the limit is removed for a very large table in a hive database

Question

All,

I'm trying to use the packages RJDBC , rJava and DBI in R to extract the data from a big hive table sitting in a mapr hive/hadoop cluster on a remote linux machine.

I don't have any issues in connecting to the hive cluster. The table1 I'm trying to extract data from is of the size 500M ( million) rows x 16 columns.

This is the code:

options(java.parameters = "-Xmx32g" )

library(RJDBC)
library(DBI)
library(rJava)

hive_dir <- "/opt/mapr/hive/hive-0.13/lib"
.jinit()
.jaddClassPath(paste(hive_dir,"hadoop-0.20.2-core.jar", sep="/"))
.jaddClassPath(c(list.files("/opt/mapr/hadoop/hadoop-0.20.2/lib",pattern="jar$",full.names=T),
                 list.files("/opt/mapr/hive/hive-0.13/lib",pattern="jar$",full.names=T),
                 list.files("/mapr/hadoop-dir/user/userid/lib",pattern="jar$",full.names=T)))
drv <- JDBC("org.apache.hive.jdbc.HiveDriver","hive-jdbc-0.13.0-mapr-1504.jar",identifier.quote="`")


hive.master <- "xx.xxx.xxx.xxx:10000"
url.dbc <-  paste0("jdbc:hive2://", hive.master)
conn = dbConnect(drv, url.dbc, "user1", "xxxxxxxx")

dbSendUpdate(conn, "set hive.resultset.use.unique.column.names=false")

df <- dbGetQuery(conn, "select *  from dbname.table1 limit 1000000 ") # basically 1 million rows

The above works perfectly and df data.frame contains exactly what I want. However if I remove the limit from the last code segment, I get an error:

df <- dbGetQuery(conn, "select *  from dbname.table1 ")

Error:

Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ",  :    Unable to retrieve JDBC result set for select *  from dbname.table1  (Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask)

I googled the last part of the error Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask and tried to put these 2 statements before the dbGetQuery, but couldn't get rid of the error.

dbSendUpdate(conn, "SET hive.auto.convert.join=false")
dbSendUpdate(conn, "SET hive.auto.convert.join.noconditionaltask=false")

Does anybody have any idea why I'm getting the error when I remove the limit from my select statement ? It even works for 120 million rows with the limit, but takes a long time. At this point, the time taken is not a concern for me.

dbGetQuery throttles when the limit is removed for a very large table in a hive database

0 Answers0