Handling embedded new lines when creating/selecting External Tables in SQL Data Warehouse

Question

In SQL Data Warehouse (editors please don't change this, it is the actual name see: here) I have a JobCandidate_ext external table that looks like this.

CREATE EXTERNAL TABLE [HumanResources].[JobCandidate_ext](
    [JobCandidateID] int,
    [BusinessEntityID] int,
    [Resume] Varchar(8000),
    [ModifiedDate] Datetime
)
WITH (
    LOCATION='/[HumanResources].[JobCandidate]/data.txt',
    DATA_SOURCE=AzureStorage,
    FILE_FORMAT=TextFile)
GO

The column [Resume] was an XML type in SQL Server but in SQL Data Warehouse XML types should be converted to varchar(8000) as described here.

I am using a flat file data.txt to export the data to a blob and then create an external table from it.

The [Resume] column has carriage returns in it (as expected from an XML file), and so when you run a SELECT * FROM [HumanResources].[JobCandidate_ext] you get an error. In this case:

Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 2 rows processed.
(/[HumanResources].[JobCandidate]/data.txt)Column ordinal: 0, Expected data type: INT, Offending value: some text .... (Column Conversion Error), Error: Error converting data type NVARCHAR to INT.

I know that I cannot configure a row delimiter when creating external tables as described here.

The row delimiter must be UTF-8 and supported by Hadoop’s LineRecordReader. The row delimiter must be either '\r', '\n', or '\r\n'. These are not user-configurable.

And if you try to put quotes on each column field you get this error while selecting rows from the external table: No closing string delimiter.

Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
(/[HumanResources].[JobCandidate]/data.txt)Column ordinal: 2, Expected data type: VARCHAR(8000) collate SQL_Latin1_General_CP1_CI_AS, Offending value: 'ShaiBassli (Tokenization failed), Error: No closing string delimiter.

Is there a way to get around this issue?

Unfortunately I don't believe there is a way around this issue currently other than ensuring your string delimiter and row delimiter are characters that never appear in the data. Please vote for: https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/10600132-polybase-allow-line-ends-within-qualified-text-f — GregGalloway, Mar 22 '16 at 00:47

score 1 · Answer 1 · answered Mar 23 '16 at 23:35

Today, PolyBase does not allow for row or field delimiters inside fields i.e. it does not allow you to escape these characters. As Greg pointed out, you can vote for this functionality here: https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/10600132-polybase-allow-line-ends-within-qualified-text-f

To workaround this limitation, you can either pre-process the data (using sed or tr for example) to replace unwanted characters before reading it with PolyBase. Or you can switch to other polybase supported file formats RCFile/ORC/Parquet to avoid dealing with row and field delimiters completely.

Handling embedded new lines when creating/selecting External Tables in SQL Data Warehouse

1 Answers1