0

I have data that I extracted from SQL Server using BCP the file is ASCII CSV.
Dates in the 2016-03-03T23:00:00 format.

When running the extract I get

Additional information:

{"diagnosticCode":195887127,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_COLUMN_CONVERSION_INVALID_ERROR","message":"Invalid character when attempting to convert column data.","description":"HEX: \"223022\" Invalid character when converting input record.\nPosition: line 1, column 21.","resolution":"Check the input for errors or use \"silent\" switch to ignore over(under)-sized rows in the input.\nConsider that ignoring \"invalid\" rows may influence job results and that types have to be nullable for conversion errors to be ignored.","helpLink":"","details":"============================================================================================\nHEX:5432333B35313B34362D323031362E30332E30335432333B30303B30302D302D352D323031362E30332E30335432333B35313B34392F3536372D302D323031362E30332E3033\n ^\nTEXT:T23:51:46,2016-03-03T23:00:00,0,5,2016-03-03T23:51:49.567,0,2016-03-03\n

How do you handle dates properly on extraction? It's unclear to me why it is spliting in the middle of a date time column.

A sample row looks like

50CA2FBB-95C3-4216-A729-999BE2DB491A,2016-03-03T23:51:49.567,1001464881,1001464795,1001464795,00000000-0000-0000-0000-000000000000,00000000-0000-0000-0000-000000000000,100 ,100 , ,12643,bCAwvRnNVwrKDXKxZkVed2Z1zHY=,o2lsnhueDApmvSbm31mh3aetYnc=,2016-03-03T23:50:46,2016-03-03T23:00:00,2016-03-03T23:51:46,2016-03-03T23:00:00,0,5,2016-03-03T23:51:49.567,0,2016-03-03T00:00:00,2016-03-03T23:59:59,00000000-0000-0000-0000-000000000000

Extract Statement is
@res =
EXTRACT 
        LicenseId Guid,
        EntryDate DateTime,
        UltimateId long,
        SiteId string,
        VirtualId string,
        ProjectId Guid,
        DocumentId Guid,
        MasterId string,
        ProductId string,
        FeatureString string,
        VersionId long,
        ComputerSid string,
        UserSid string,
        AppStartTime DateTime,
        StartHour DateTime,
        AppStopTime DateTime,
        StopHour DateTime,
        GmtDelta int,
        RecordedGmtDelta int,
        LastUpdated DateTime,
        Processed bool,
        StartDate DateTime,
        EndDate DateTime,
        ImsId Guid
FROM @dataFile
USING Extractors.Csv();

1 Answers1

0

The default encoding of the built-in extractors is Encoding.UTF-8. So most likely, the three byte sequence you see is being interpreted as UTF-8 and not ASCII.

If your BCP output really only contains code points in the ASCII range (0-127) (and not ANSI 8 bit characters), you can specify Extractors.Csv(encoding:Encoding.[ASCII]) (note the [] around ASCII to escape them from the reserved keyword rule).

If your data however is containing ANSI range characters, you have to BCP out as either UTF-16 (I don't think BCP supports UTF-8), or convert the result of BCP into UTF-8.

Note that if the file is larger than 250MB, we currently have a bug around the record boundary detection when uploading the file if it is in UTF-16 encoding. Until we have this bug fixed, I suggest you upload the file with UTF-8 encoding in that case.

Also, if you need the full ANSI codepage supported, please vote your support for the user voice item at https://feedback.azure.com/forums/327234-data-lake/suggestions/13077555-add-ansi-code-page-support-for-built-in-extractors and provide the code page that you need to have supported (e.g., Windows-1254 or ISO-Latin-1).

Michael Rys
  • 6,684
  • 15
  • 23
  • How does this handle situations where the file is UTF-8, but there is still a Latin-1 character in a record somewhere? U-SQL is still going to fail it for one record out of 1 billion. – Fastidious Mar 15 '18 at 20:26
  • Right now the built-in extractors will fail. You can write a custom extractor that handles it if you so chose. Also feel free to request an option to ignore such a value in some form (please specify) at http://aka.ms/adlfeedback – Michael Rys Mar 19 '18 at 19:05
  • Yeah, writing a custom extractor is not a good option. I will certainly suggest something similar to how Polybase is handling it, which is just rejecting the row and or converting it. Right now, I do believe if the file is zipped, Polybase will extract and re-encode it. In this case, we have to re-encode it to get the USQL job to fire. – Fastidious Mar 21 '18 at 01:59