0

I have a hive table with three columns, delimited by spaces

hive (database)> describe formatted my_table;
# col_name data_type comment
field1 string 
field2 string 
field3 string
... ...
Storage Desc Params: 
input.regex (\\S+)\\s+(\\S+)\\s+(.*)

// mount dataset to table
hive (database)> ALTER TABLE lsr_staging_test SET LOCATION '/tmp/old_data.txt';
hive (database)> select * from my_table limit 2;
DataX DataX DataX
DataX DataX DataX

I want to add two new columns to the hive table schema, so I did:

// add two columns
hive (database)> alter table my_table add columns(field4 string,field5 string);
// add two more (\\S+)\\s+ for the new columns
hive (database)> alter table my_table SET SERDEPROPERTIES ("input.regex" = "(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(.*)");

But seems it doesn't work with the old dataset (the old data file with three columns):

hive (database)> select * from my_table limit 2;
NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL

I expected something like this:

DataX DataX DataX NULL NULL
DataX DataX DataX NULL NULL

Could someone help me understand why all the fields show NULL after adding the columns to schema? Thanks!

dtolnay
  • 9,621
  • 5
  • 41
  • 62
  • `(.*)` is a greedy match. I don't think you should be capturing that, plus your regex won't match if there's only 3 columns. Please share your input file. Did you try a new file with 5 columns? Also, why use regex when you can use TSV serde – OneCricketeer Oct 12 '22 at 12:25
  • A new file with 5 columns works well. I just expect the old file with 3 columns should work as well (show data of three columns and show NULLs for the extra columns). The file is a txt file, which contains "ls -l" of the current directory, columns are separated by tabs (spaces). – Nickswaggy Oct 12 '22 at 16:14
  • Have you tried defining the table as `STORED AS TEXT FIELDS TERMINATED BY '\t'` instead? The problem with the regex is that the **whole line** needs to match, so if the text changes to have 5 matches, then it's either all-matched, or none match, and you get all NULL for the groups that exist in the pattern. Also, files and folders can have spaces, so the regex isn't "proper" for `ls -l` output – OneCricketeer Oct 12 '22 at 16:46
  • sorry they are not actually separated by tabs, separated by one or more spaces instead. So "STORED AS TEXT FIELDS TERMINATED BY '\t'" won't work. The old regex works well and let's assume that there's no empty space in the file names. From what you said, does that mean there's no way to change the regex to accommodate the old data file but we have to change the data source to 5 columns? – Nickswaggy Oct 12 '22 at 17:47
  • I'm not really sure. All I know is that NULL typically happens if the regex cannot match at all. This page has some details that might help - https://community.cloudera.com/t5/Community-Articles/Using-Regular-Expressions-to-Extract-Fields-for-Hive-Tables/ta-p/247562 – OneCricketeer Oct 12 '22 at 18:07
  • Yeah seems like you are right. Thank you! – Nickswaggy Oct 12 '22 at 21:53
  • 1
    After using this regex it works: "input.regex" = "(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s*(\\S*)\\s*(.*)" – Nickswaggy Oct 12 '22 at 22:25
  • So now you have 10 columns? Seems like you'd still have the same problem? But feel free to answer your own post below rather than as a comment – OneCricketeer Oct 13 '22 at 13:32
  • Yea.. the real dataset has 10 columns... I use 3 columns as an example. But you are right that this won't work for file name with spaces. But I'm not sure how to adjust the 8th group in regex to make it work for all file names, because you use spaces to separate the fields while filename can have spaces.. Any ideas? – Nickswaggy Oct 24 '22 at 05:34
  • `(.*)` group will capture spaces just fine – OneCricketeer Oct 24 '22 at 17:18

0 Answers0