First off, apologies if this comes across poorly worded, I've tried to help myself but I'm not clear on where its not right.
I'm trying to query data in Impala which has been exported from another system.
Up till now its been exported as a pipe-delimited text file which I've been able to import fine by creating the table with the right delimiter set-up, copying in the file and then running a refresh
statement.
We've had some issues where some fields have line-break characters and this has made it look like we've got more data and it doesn't necessarily fit the metadata I've created.
The suggestion was made that we could use Parquet format instead and this would cope with the internal line-breaks fine.
I've received data and it looks a bit like this (I changed the username):
-rw-r--r--+ 1 UserName Domain Users 20M Jan 17 10:15 part-00000-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet
-rw-r--r--+ 1 UserName Domain Users 156K Jan 17 10:15 .part-00000-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet.crc
-rw-r--r--+ 1 UserName Domain Users 14M Jan 17 10:15 part-00001-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet
-rw-r--r--+ 1 UserName Domain Users 110K Jan 17 10:15 .part-00001-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet.crc
-rw-r--r--+ 1 UserName Domain Users 0 Jan 17 10:15 _SUCCESS
-rw-r--r--+ 1 UserName Domain Users 8 Jan 17 10:15 ._SUCCESS.crc
If I create a table stored as parquet through Impala and then do an hdfs dfs -ls
on that I get something like the following:
-rwxrwx--x+ 3 hive hive 2103 2019-01-23 10:00 /filepath/testtable/594eb1cd032d99ad-5c13d29e00000000_1799839777_data.0.parq
drwxrwx--x+ - hive hive 0 2019-01-23 10:00 /filepath/testtable/_impala_insert_staging
Which is obviously a bit different to what I've received...
How do I create the table in Impala to be able to accept what I've received and also do I just need the .parquet files in there or do I also need to put the .parquet.crc files in?
Or is what I've received not fit for purpose?
I've tried looking at the Impala documentation for this bit but I don't think that's covering it.
Is it something that I need to do with serde?
I tried specifiying the compression_codec as snappy but this gave the same results.
Any help would be appreciated.