2

when running my extract, got this error:

Found invalid character-encoding for UTF-8 encoding in input. The input file may contain corrupted data, or the specified input encoding in the extractor does not match the actual file encoding. See the DETAILS section for a hexadecimal dump of the file segment containing the invalid character-encoding.

I am not able to read UTF-8 character data through below U-SQL script.

@cgadmdomain =
EXTRACT 
row_id string,
orgarea_name string,
last_changed_time string,
start_date string,
stop_date string,
domain_name string,
gui_description string,
media string,
direction string,
distribution string,
threshold1 string,
threshold2 string


FROM @cgadmdomainInPath USING Extractors.Text(delimiter: ';');

File has the data "Test Kö CB" for media column . If I remove this particular record then my script is running fine,please let me know if i need to add anything to the parameters

SHR
  • 7,940
  • 9
  • 38
  • 57
Hari CR
  • 43
  • 3

1 Answers1

2

Are you sure that the file is encoded in UTF-8 and not some other encoding? What is the byte sequence that you see if you open the file with a byte level editor?

Depending on that, you may have to set it to the appropriate Windows-125x encoding or Unicode.

If your data is for example encoded with Windows-1252, you can extract it with the following statement (note we currently only support Windows-125x encoding next to the Unicode encodings):

  @data = 
    EXTRACT ...
    FROM ... 
    USING Extractors.Csv(encoding:System.Text.Encoding.GetEncoding("Windows-1252"));
Michael Rys
  • 6,684
  • 15
  • 23
  • The sample data is being copied from blob storage to Azure datalake store, during the copy activity sample data gets automatically encoded with UTF-8 format. While performing U-sql activity, input automatically goes with UTF-8 encoding. – Hari CR Apr 10 '18 at 06:44
  • Something must have gone wrong with the automatic encoding. Feel free to send me the document to check (if that is possible) or use a tool like Notepad++ to further investigate what the actual encoding is. – Michael Rys Apr 10 '18 at 16:08
  • Thanks!! your response was very helpful, I used notepad ++ for checking the encoding option. Input files are encoded with ANSI, is there a way to extract ANSI encoded files? – Hari CR Apr 11 '18 at 07:04