0

We have a small pivotal Hadoop-hawq cluster.We have created externtal table on it and pointing to hadoop files.

Given Environment:

Product Version: (HAWQ 1.3.0.2 build 14421) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2

Tried :

When We are trying to read from external table using command. ie

test=# select count(*) from EXT_TAB ; GETTING following error : ERROR: data line too long. likely due to invalid csv data (seg0 slice1 SEG0.HOSTNAME.COM:40000 pid=447247) 
DETAIL: External table trcd_stg0, line 12059 of pxf://hostname/tmp/def_rcd/?profile=HdfsTextSimple: "2012-08-06 00:00:00.0^2012-08-06 00:00:00.0^6552^2016-01-09 03:15:43.427^0005567^COMPLAINTS ..."  :

Additional Information:

DDL of external table is :

CREATE READABLE EXTERNAL TABLE sysprocompanyb.trcd_stg0
(
    "DispDt" DATE,
    "InvoiceDt" DATE,
    "ID" INTEGER,
    time timestamp without time zone,
    "Customer" CHAR(7),
    "CustomerName" CHARACTER VARYING(30),
    "MasterAccount" CHAR(7),
    "MasterAccName" CHAR(30),
    "SalesOrder" CHAR(6),
    "SalesOrderLine" NUMERIC(4, 0),
    "OrderStatus" CHAR(200),
    "MStockCode" CHAR(30),
    "MStockDes" CHARACTER VARYING(500),
    "MWarehouse" CHAR(200),
    "MOrderQty" NUMERIC(10, 3),
    "MShipQty" NUMERIC(10, 3),
    "MBackOrderQty" NUMERIC(10, 3),
    "MUnitCost" NUMERIC(15, 5),
    "MPrice" NUMERIC(15, 5),
    "MProductClass" CHAR(200),
    "Salesperson" CHAR(200),
    "CustomerPoNumber" CHAR(30),
    "OrderDate" DATE,
    "ReqShipDate" DATE,
    "DispatchesMade" CHAR(1),
    "NumDispatches" NUMERIC(4, 0),
    "OrderValue" NUMERIC(26, 8),
    "BOValue" NUMERIC(26, 8),
    "OrdQtyInEaches" NUMERIC(21, 9),
    "BOQtyInEaches" NUMERIC(21, 9),
    "DispQty" NUMERIC(38, 3),
    "DispQtyInEaches" NUMERIC(38, 9),
    "CustomerClass" CHAR(200),
    "MLineShipDate" DATE
)
LOCATION (
    'pxf://HOSTNAME-HA/tmp/def_rcd/?profile=HdfsTextSimple'
)
FORMAT 'CSV' (delimiter '^' null '' escape '"' quote '"')
ENCODING 'UTF8';

Any help would be much appreciated ?

NEO
  • 389
  • 8
  • 31

2 Answers2

2

based on source code: https://github.com/apache/incubator-hawq/blob/e48a07b0d8a5c8d41d2d4aaaa70254867b11ee11/src/backend/commands/copy.c

The error occurs when cstate->line_buf.len >= gp_max_csv_line_length is true. According to: http://hawq.docs.pivotal.io/docs-hawq/guc_config-gp_max_csv_line_length.html

the default length of csv is 1048576 bytes. Have you checked your csv file length and tried increasing value of this setting?

Goden Yao
  • 66
  • 3
  • Thanks for reply. We did the corresponding changes default value to (SET gp_max_csv_line_length TO '4194304'). 4194394 Bytes. But still throwing the same error. – NEO Jan 11 '16 at 20:48
2

check if line 12059 number of columns match the number of delimited fields. If some lines get grouped together during parsing then we might exceed the max line length. this typically happens because of bad data echo $LINE | awk -F "^" '(total = total + NF); END {print total}'

dlynch
  • 156
  • 1
  • 1
  • 3