Sampling Issue with hive

Question

"all_members" is a table in hive with 10m rows and 1 column: "membership_nbr". I want to sample 3000 rows. This is what I have done:

hive>create table sample_members as select * from all_members limit 1;
hive>insert overwrite table sample_members select membership_nbr from all_members tablesample(3000 rows);
hive>select count(*) from sample_members;

OK 45000

The result wont change if I replace 3000 rows with 300 rows Do I do something wrong?

anand · Accepted Answer · 2015-10-01T06:55:06.643

Table Sampling using tablesample(3000 rows) wont fetch 3000 rows from entire table instead it will fetch 3000 rows from each input split.

So, your query might run 15 mappers. So, each mapper will fetch 3000 rows. Totally, 3000 * 15 = 45000 rows. Also, if you change the 3000 rows to 300 rows you will get 4500 rows as output after sampling.

So, as per your requirement you have to give tablesample(200 rows). As a result each mapper will fetch 200 rows. Finally, 15 mappers will fetch 3000 sampling rows.

Refer the below link for various types of sampling: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

Sampling Issue with hive

1 Answers1