In Hive, how do I load only part of the raw data to a table?

Question

I've got a typical CREATE TABLE statement as follows:

CREATE EXTERNAL TABLE temp_url (
    MSISDN STRING,
    TIMESTAMP STRING,
    URL STRING,
    TIER1 STRING
)
row format delimited fields terminated by '\t' lines terminated by '\n'
LOCATION 's3://mybucket/input/project_blah/20140811/';

Where /20140811/ is a directory with gigabytes worth of data inside.

Loading the things is not a problem. Querying anything on it, however, chokes Hive up and simply gives me a number of MapRed errors.

So instead, I'd like to ask if there's a way to load only part of the data in /20140811/. I know I can select a few files from inside the folder, dump them into another folder, and use that, but it seems tedious, especially when I've got 20 or so of this /20140811/ directories.

Is there something like this:

CREATE EXTERNAL TABLE temp_url (
    MSISDN STRING,
    TIMESTAMP STRING,
    URL STRING,
    TIER1 STRING
)
row format delimited fields terminated by '\t' lines terminated by '\n'
LOCATION 's3://mybucket/input/project_blah/Half_of_20140811/';

I'm also open to non-hive answers. Perhaps there's a way in s3cmd to quickly get a certain amount of data inside /20140811/ dump it into /20140811_halved/ or something.

Thanks.

Do your files have a naming pattern? In that case, you can use wildcards to filter out/move files. — Abhishek Pathak, Sep 29 '14 at 07:04
I would suggest actually debugging the errors when you try to query the table instead of working around them. Hive is designed to query terabytes of data. Perhaps you meant to use `s3n://` instead of `s3://`? — Joe K, Sep 30 '14 at 16:30

score 0 · Accepted Answer · answered Sep 29 '14 at 07:18

0

I would suggest the following as a workaround :

Create a temp table with same structure. (using like)
insert into NEW_TABLE select * from OLD_TABLE limit 1000;

You add as many filter conditions to filter out data and load.

Hope this helps you.

answered Sep 29 '14 at 07:18

K S Nidhin

2,622
2
22
44

This is probably the simplest answer that will work, though I was looking to avoid any `SELECT WHERE LIMIT` on the initial table mostly because that'll take up time. – zack_falcon Sep 29 '14 at 10:28
If you are having impala synched to the same hive metastore , this process might take less than a minute. – K S Nidhin Sep 29 '14 at 10:35

score 0 · Answer 2 · answered Sep 29 '14 at 08:38

0

Since you are saying that you have "20 or so of this /20140811/ directories", why don't you try creating an external table with partitions on those directories and run your queries on a single partition.

answered Sep 29 '14 at 08:38

Amar

3,825
1
23
26

You mean, create a table on /project_blah/, and then add /20140811/ as partition? Wouldn't that be just the same as using the entire contents of /20140811/? If so, that's what I'm avoiding. – zack_falcon Sep 29 '14 at 09:03
could you tell me your directory structure for the folder over which you are creating the hive table ? – Amar Sep 29 '14 at 09:04
/20140811/? It's just a folder that has files in it. Said files are named something 0000cee31wfa00_part01 or something similar - also files that hive spat out. – zack_falcon Sep 29 '14 at 09:39
ok...so is it possible for you to create a folder along with 20140811 (for ex, 20140811_2) and put one of the many files from the original 201408 folder to this new folder and then create a partition over the two directories (20140811 and 20140811_2). And while querying you can specify that search specifically on 20140811_2. This would prevent the queries on doing the operations on the entire data and would run faster – Amar Sep 29 '14 at 09:55

In Hive, how do I load only part of the raw data to a table?

2 Answers2