Pivotal HDB -Complaints "Data line too long.likely due to invalid csv data"

Question

We have a small pivotal Hadoop-hawq cluster.We have created externtal table on it and pointing to hadoop files.

Given Environment:

Product Version: (HAWQ 1.3.0.2 build 14421) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2

Tried :

When We are trying to read from external table using command. ie

test=# select count(*) from EXT_TAB ; GETTING following error : ERROR: data line too long. likely due to invalid csv data (seg0 slice1 SEG0.HOSTNAME.COM:40000 pid=447247) 
DETAIL: External table trcd_stg0, line 12059 of pxf://hostname/tmp/def_rcd/?profile=HdfsTextSimple: "2012-08-06 00:00:00.0^2012-08-06 00:00:00.0^6552^2016-01-09 03:15:43.427^0005567^COMPLAINTS ..."  :

Additional Information:

DDL of external table is :

CREATE READABLE EXTERNAL TABLE sysprocompanyb.trcd_stg0
(
    "DispDt" DATE,
    "InvoiceDt" DATE,
    "ID" INTEGER,
    time timestamp without time zone,
    "Customer" CHAR(7),
    "CustomerName" CHARACTER VARYING(30),
    "MasterAccount" CHAR(7),
    "MasterAccName" CHAR(30),
    "SalesOrder" CHAR(6),
    "SalesOrderLine" NUMERIC(4, 0),
    "OrderStatus" CHAR(200),
    "MStockCode" CHAR(30),
    "MStockDes" CHARACTER VARYING(500),
    "MWarehouse" CHAR(200),
    "MOrderQty" NUMERIC(10, 3),
    "MShipQty" NUMERIC(10, 3),
    "MBackOrderQty" NUMERIC(10, 3),
    "MUnitCost" NUMERIC(15, 5),
    "MPrice" NUMERIC(15, 5),
    "MProductClass" CHAR(200),
    "Salesperson" CHAR(200),
    "CustomerPoNumber" CHAR(30),
    "OrderDate" DATE,
    "ReqShipDate" DATE,
    "DispatchesMade" CHAR(1),
    "NumDispatches" NUMERIC(4, 0),
    "OrderValue" NUMERIC(26, 8),
    "BOValue" NUMERIC(26, 8),
    "OrdQtyInEaches" NUMERIC(21, 9),
    "BOQtyInEaches" NUMERIC(21, 9),
    "DispQty" NUMERIC(38, 3),
    "DispQtyInEaches" NUMERIC(38, 9),
    "CustomerClass" CHAR(200),
    "MLineShipDate" DATE
)
LOCATION (
    'pxf://HOSTNAME-HA/tmp/def_rcd/?profile=HdfsTextSimple'
)
FORMAT 'CSV' (delimiter '^' null '' escape '"' quote '"')
ENCODING 'UTF8';

Any help would be much appreciated ?

Can you provide the DDL for trcd_stg0 and the first few lines of the files from /tmp/def_rcd? — Jon Roberts, Jan 11 '16 at 20:24

score 2 · Accepted Answer · answered Jan 11 '16 at 20:20

2

based on source code: https://github.com/apache/incubator-hawq/blob/e48a07b0d8a5c8d41d2d4aaaa70254867b11ee11/src/backend/commands/copy.c

The error occurs when cstate->line_buf.len >= gp_max_csv_line_length is true. According to: http://hawq.docs.pivotal.io/docs-hawq/guc_config-gp_max_csv_line_length.html

the default length of csv is 1048576 bytes. Have you checked your csv file length and tried increasing value of this setting?

answered Jan 11 '16 at 20:20

Goden Yao

66
3

Thanks for reply. We did the corresponding changes default value to (SET gp_max_csv_line_length TO '4194304'). 4194394 Bytes. But still throwing the same error. – NEO Jan 11 '16 at 20:48

dlynch · Answer 2 · 2016-01-11T20:53:46.267

2

check if line 12059 number of columns match the number of delimited fields. If some lines get grouped together during parsing then we might exceed the max line length. this typically happens because of bad data echo $LINE | awk -F "^" '(total = total + NF); END {print total}'

edited Jan 11 '16 at 20:53

answered Jan 11 '16 at 20:27

dlynch

156
1
1
3

We have exceeded max length parameter to 4MB and error remains – NEO Jan 12 '16 at 15:00

Pivotal HDB -Complaints "Data line too long.likely due to invalid csv data"

2 Answers2