I have a script on a Linux server (Cloudera) that uses unixODBC/isql to pull data from a Netezza server. It generates a text file on the local filesystem and then pushes it to HDFS.
export ODBCHOME=;
export ODBCINI=;
export LIBPATH=;
export LD_LIBRARY_PATH=;
NZ_ODBC_INI_PATH=/home/myuser/sql/;
export NZ_ODBC_INI_PATH;
export ODBCINI=/home/myuser/sql/odbc.ini;
export NZ_HOME=/usr/local/nz;
export PATH=$PATH:$NZ_HOME/bin64;
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$NZ_HOME/lib64;
export NZ_HOST=myntzserver
EXTRACT_PATH=/home/myuser/trgt_path/;
export EXTRACT_PATH;
cd $EXTRACT_PATH;
hive_char=`printf "\035\n"`
isql nettezaServer1 tsg1dev \$\$infadmin11 -v<<EOF
drop table test_1;
CREATE EXTERNAL TABLE test_1 ( GRP BIGINT, ID BIGINT, DATA CHARACTER V VARYING(10) ) USING (DATAOBJECT ("$EXTRACT_PATH/test_1.txt") REMOTESOURCE 'ODBC' delimiter 2);
insert into test_1 SELECT GRP,ID,DATA FROM SCHEMA1..test_table;
EOF
hdfs dfs -put /home/myuser/trgt_path/test_1.txt /user/projects/my_dir/
I am wondering if there is a way to either connect directly to HDFS, or pipe the data to HDFS as it comes in. In other words, I want to skip creating the text file on the local filesystem due to space constraints (the source table is hundreds of GB).
The reason I am not using sqoop is because the time it takes is too long for the project requirements. The script above is able to create the text file in 2-3 hours whereas sqoop on average is taking 7-8 to load the data. The sqoop expert on my team tried to fine-tune the command to get the time down, but had no luck.
To summarize, is it possible to skip the local filesystem and load this data directly to HDFS?