2

Following is the gist of my problem.

Env: Hadoop 2 (CDH5.1) database: oracle 11g

Scenarios: I'm sqooping fact and dimension tables from the database into hdfs. Initially, I had challenges in handling nulls (which was handled using --null-string and --non-null-string) which was set to \N as per the recommendation. Everything was fine when the hive table that was built had string fields even for date and numerics.

Solution so far Based on a recommendation, I move to importing using the Avro format. I've built the hive table on the avro data and I'm able to query the tables. Now I need to create Hive joins and convert all the fields to their required type like dates to be dates/timestamps, numerics to be int/bigint etc. After the sqooping the avro schema created had converted all date fields to long and the hive table show bigint for those columns.

I'm confused around how sqoop is handling nulls and how those are to be handled in hive/hdfs MR etc.

Could you anybody suggest any practice that has been adopted that could be leveraged?

Thanks Venkatesh

venBigData
  • 600
  • 1
  • 8
  • 23

3 Answers3

3

It was a problem for me too. When I improted schema from parquet tables.. as Parquet stores timestamp as bigint. So I guess the underlying problem is parquet that does not have a separate datatype to store timestamp. Don't use AVRO very often, but I think it is true for AVRO too. So if you sqoop from Oracle date/timestamp into a set of parquet/avro files, then storage type (bigint) is how it is stored, not how you want to access it as (timestamp/date).

That time is stored as number of milliseconds from UNIX epoch time (Jan 1st 1970). There are Hive/Spark/Impala functions from_unixtime() that take number of seconds so the solution is to convert those ms values to s resolution:

SELECT .. 
, from_unixtime(cast(bigint_column/1000 as bigint))

So you will see timestamps like:

 1999-04-14 06:00:00 
 1999-04-15 06:00:00

Notice 6 hours shift. In my case original Oracle's data type was DATE without any time part (00:00:00), but I got time shifted by 06 hours because of my timezone (MST). So to get exact dates:

SELECT .. 
, from_unixtime(cast(bigint_column/1000 - 6*3600 as bigint))

which resulted in:

 1999-04-14 00:00:00 
 1999-04-15 00:00:00

ps. "Data Type Considerations for Parquet Tables" http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_parquet.html#parquet_data_types_unique_1 :

INT96 -> TIMESTAMP

Tagar
  • 13,911
  • 6
  • 95
  • 110
1

Thanks Gergely. The approaches that we followed to overcome this issue was to sqoop import the date fields as Strings type when sqooped into hdfs. This was achieve using

sqoop --option-file $OPTION_FILE_NAME \ --table $TABLE_NAME \ --map-column-java DAY_END_DTE=String \ --target-dir $TARGET_DIR \ --as-avrodatafile

This would cause the timestamp information to be sqooped as string of 'yyyy-mm-dd hh:mm:ss.f' format which could be casted into a date field.

venBigData
  • 600
  • 1
  • 8
  • 23
0

it is not a solution, it is a workaround:

You can convert the imported data to timestamp with this command:

select cast(long_column as TIMESTAMP) from imported_table;

BR, Gergely

gszecsenyi
  • 93
  • 9