How to make hive return rows faster? (say, limit 10)

Question

I noticed that when MR is involved, even for simply queries like:

select * from table where condition_a and condition_b limit 10;

Hive will scan all the data (sure), but it will only return when all the MR jobs finish, even we only need 10 results.

Is there a way to tell hive just return the first 10 rows, then skip the result of the job?

Note, I am taking about when MR job is involved, not for the simple case where no actual MR job is created.

Thanks!

It would run faster if you used Tez or Spark execution engines, by the way — OneCricketeer, Nov 21 '19 at 05:33

score 1 · Answer 1 · answered Nov 21 '19 at 09:30

In my point of view, Hive is not like hbase, if You are looking for a hbase like speed when querying then you are not going to get it in hive.

Hive is meant for batch processing which deals with large amount of data unlike hbase which looks for only one data at a time. When you try to run a hive query it will run the MR job, the MR job will run exactly the same process for both 1 billion records and 10 records.

So it may be faster when you set some environmental variables mentioned in other answers.but you cannot expect the speed of hbase or RDBMS or other databases when dealing with small data

score 1 · Answer 2 · answered Nov 21 '19 at 12:55

You can use hive's limit optimization properties to achieve desired results link

 <property>
    <name>hive.limit.optimize.enable</name>
    <value>true</value>
    <description>Whether to enable to optimization to trying a smaller subset of data for simple LIMIT first.</description>
  </property>
  <property>
    <name>hive.limit.row.max.size</name>
    <value>100000</value>
    <description>When trying a smaller subset of data for simple LIMIT, how much size we need to guarantee each row to have at least.</description>
  </property>
  <property>
    <name>hive.limit.optimize.limit.file</name>
    <value>10</value>
    <description>When trying a smaller subset of data for simple LIMIT, maximum number of files we can sample.</description>
  </property>
  <property>
    <name>hive.limit.optimize.fetch.max</name>
    <value>50000</value>
    <description>
      Maximum number of rows allowed for a smaller subset of data for simple LIMIT, if it is a fetch query. 
      Insert queries are not restricted by this limit.
    </description>
  </property>

You can also refer to these properties from hive docs

score 0 · Answer 3 · edited Nov 21 '19 at 06:01

0

Can you try

hive.mapred.mode = nonstrict

and

hive.map.aggr = false

apart from these you can try sort by and distribute by.

edited Nov 21 '19 at 06:01

Gaurav Jeswani

4,410
6
26
47

answered Nov 21 '19 at 05:38

Sasra S

21
1

score 0 · Answer 4 · answered Nov 21 '19 at 07:42

Since, it's a mapreduce way. It needs to wait before actually returning the results(i.e. need to commit the results first).

If you need the same query results faster, Please note same query, then cache the results for faster retrieval next time.

set hive.query.results.cache.enabled=true

Or use the spark to fetch the results faster. Spark does most of the calculations in RAM. Also to make it more faster cache the query results.

 val SPARK:SparkSession = SparkSession
    .builder
    .appName("Test")
    .master("local[2]")
    .config("spark.sql.warehouse",new File("spark-warehouse").getAbsolutePath)
    .enableHiveSupport()
    .getOrCreate()

val SQL_CONTEXT:SQLContext = SPARK.sqlContext

val res = SQL_CONTEXT.sql("select * from table where condition_a and condition_b limit 10").cache()

How to make hive return rows faster? (say, limit 10)

4 Answers4