How to identify the line in my csv file causing my bulk load map reduce job to fail in apache phoenix

Question

I'm trying to load a csv file stored on hdfs, with about 140 billions lines, with apache phoenix bulk load tool.

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/etc/hbase/conf:/etc/hadoop/conf
export HBASE_CONF_PATH=/etc/hbase/conf:/etc/hadoop/conf
hadoop jar phoenix-4.7.0.2.6.5.0-292-client.jar org.apache/phoenix.mapreduce.CsvBulkLoadTool -Ddfs.umaskmode=000 -zovh-mnode0,ovh-mnode1,ovh-mnode2,:2181:/hbase-secure --table <mytable> --input /user/hbase/<directory for sqoop import> -d \§  -e \\

But it fails with error, at 80% of mapping task, always saying the error is ocurring at starting line 1 :

Error: java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: (startline 1) EOF reached before encapsulated token finished

How to identify the line in my csv causing the error ? Log files provided by yarn logs command are too general.

My csv was generated with this sqoop import on a sql table, with following options :

-D mapreduce.job.queuename=default --fields-terminated-by \§ --escaped-by \\  --enclosed-by '\"' --null-string 'null' --null-non-string '0'

My sqoop custom query applies a cleaning function to each varchar field, to be sure my csv should not fail. Example on "lang" field :

replace(replace(replace(replace(ifnull(lang,"null"),"\""," "),"\n"," "),"\r"," "),"§"," ")

Options "-g" on bulkloadtool to skip errors does not work (known bug).

I didn't find the solution but i made my insert another way, using spark phoenix connector, avoiding the need to use csv file as an an intermediate format, directly reading from mysql as a dataframe and writing it to phoenix with saveToPhoenix method. — cmatic, Jul 10 '19 at 09:14

How to identify the line in my csv file causing my bulk load map reduce job to fail in apache phoenix

0 Answers0