Azureml TabularDataset to_pandas_dataframe() returns InvalidEncoding error

Question

When I run:

datasetTabular = Dataset.get_by_name(ws, "<Redacted>")
datasetTabular.to_pandas_dataframe()

The following error is returned. What can I do to get past this?

ExecutionError                            Traceback (most recent call last) File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\data\dataset_error_handling.py:101, in _try_execute(action, operation, dataset_info, **kwargs)
    100     else:
--> 101         return action()
    102 except Exception as e:

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\data\tabular_dataset.py:169, in TabularDataset.to_pandas_dataframe.<locals>.<lambda>()
    168 dataflow = get_dataflow_for_execution(self._dataflow, 'to_pandas_dataframe', 'TabularDataset')
--> 169 df = _try_execute(lambda: dataflow.to_pandas_dataframe(on_error=on_error,
    170                                                        out_of_range_datetime=out_of_range_datetime),
    171                   'to_pandas_dataframe',
    172                   None if self.id is None else {'id': self.id, 'name': self.name, 'version': self.version})
    173 fine_grain_timestamp = self._properties.get(_DATASET_PROP_TIMESTAMP_FINE, None)

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\_loggerfactory.py:213, in track.<locals>.monitor.<locals>.wrapper(*args, **kwargs)
    212 try:
--> 213     return func(*args, **kwargs)
    214 except Exception as e:

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\dataflow.py:697, in Dataflow.to_pandas_dataframe(self, extended_types, nulls_as_nan, on_error, out_of_range_datetime)
    696 with tracer.start_as_current_span('Dataflow.to_pandas_dataframe', trace.get_current_span()) as span:
--> 697     return get_dataframe_reader().to_pandas_dataframe(self,
    698                                                       extended_types,
    699                                                       nulls_as_nan,
    700                                                       on_error,
    701                                                       out_of_range_datetime,
    702                                                       to_dprep_span_context(span.get_context()))

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\_dataframereader.py:386, in _DataFrameReader.to_pandas_dataframe(self, dataflow, extended_types, nulls_as_nan, on_error, out_of_range_datetime, span_context)
    384     if have_pyarrow() and not extended_types and not inconsistent_schema:
    385         # if arrow is supported, and we didn't get inconsistent schema, and extended typed were not asked for - fallback to feather
--> 386         return clex_feather_to_pandas()
    387 except _InconsistentSchemaError as e:

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\_dataframereader.py:298, in
_DataFrameReader.to_pandas_dataframe.<locals>.clex_feather_to_pandas()
    297 activity_data = dataflow_to_execute._dataflow_to_anonymous_activity_data(dataflow_to_execute)
--> 298 dataflow._engine_api.execute_anonymous_activity(
    299     ExecuteAnonymousActivityMessageArguments(anonymous_activity=activity_data, span_context=span_context))
    301 try:

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\_aml_helper.py:38, in update_aml_env_vars.<locals>.decorator.<locals>.wrapper(op_code, message, cancellation_token)
     37     engine_api_func().update_environment_variable(changed)
---> 38 return send_message_func(op_code, message, cancellation_token)

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\engineapi\api.py:160, in EngineAPI.execute_anonymous_activity(self, message_args, cancellation_token)
    158 @update_aml_env_vars(get_engine_api)
    159 def execute_anonymous_activity(self, message_args: typedefinitions.ExecuteAnonymousActivityMessageArguments, cancellation_token: CancellationToken = None) -> None:
--> 160     response = self._message_channel.send_message('Engine.ExecuteActivity', message_args, cancellation_token)
    161     return response

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\engineapi\engine.py:291, in MultiThreadMessageChannel.send_message(self, op_code, message, cancellation_token)
    290     cancel_on_error()
--> 291     raise_engine_error(response['error'])
    292 else:

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\dataprep\api\errorhandlers.py:10, in raise_engine_error(error_response)
      9 if 'ScriptExecution' in error_code:
---> 10     raise ExecutionError(error_response)
     11 if 'Validation' in error_code:

ExecutionError:  Error Code: ScriptExecution.StreamAccess.Validation Validation Error Code: InvalidEncoding Validation Target: TextFile Failed Step: 78059bb0-278f-4c7f-9c21-01a0cccf7b96 Error Message: ScriptExecutionException was caused by StreamAccessException.   StreamAccessException was caused by ValidationException.
    Unable to read file using Unicode (UTF-8). Attempted read range 0:777. Lines read in the range 0. Decoding error: Unable to translate bytes [8B] at index 1 from specified code page to Unicode.
      Unable to translate bytes [8B] at index 1 from specified code page to Unicode. | session_id=295acf7e-4af9-42f1-b04a-79f3c5a0f98c

During handling of the above exception, another exception occurred:

UserErrorException                        Traceback (most recent call last) Input In [34], in <module>
      1 # preview the first 3 rows of the dataset
      2 #datasetTabular.take(3)
----> 3 datasetTabular.take(3).to_pandas_dataframe()

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\data\_loggerfactory.py:132, in track.<locals>.monitor.<locals>.wrapper(*args, **kwargs)
    130 with _LoggerFactory.track_activity(logger, func.__name__, activity_type, custom_dimensions) as al:
    131     try:
--> 132         return func(*args, **kwargs)
    133     except Exception as e:
    134         if hasattr(al, 'activity_info') and hasattr(e, 'error_code'):

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\data\tabular_dataset.py:169, in TabularDataset.to_pandas_dataframe(self, on_error, out_of_range_datetime)
    158 """Load all records from the dataset into a pandas DataFrame.
    159 
    160 :param on_error: How to handle any error values in the dataset, such as those produced by an error while    (...)
    166 :rtype: pandas.DataFrame
    167 """
    168 dataflow = get_dataflow_for_execution(self._dataflow, 'to_pandas_dataframe', 'TabularDataset')
--> 169 df = _try_execute(lambda: dataflow.to_pandas_dataframe(on_error=on_error,
    170                                                        out_of_range_datetime=out_of_range_datetime),
    171                   'to_pandas_dataframe',
    172                   None if self.id is None else {'id': self.id, 'name': self.name, 'version': self.version})
    173 fine_grain_timestamp = self._properties.get(_DATASET_PROP_TIMESTAMP_FINE, None)
    175 if fine_grain_timestamp is not None and df.empty is False:

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\data\dataset_error_handling.py:104, in _try_execute(action, operation, dataset_info, **kwargs)
    102 except Exception as e:
    103     message, is_dprep_exception = _construct_message_and_check_exception_type(e, dataset_info, operation)
--> 104     _dataprep_error_handler(e, message, is_dprep_exception)

File C:\ProgramData\Anaconda3_2\envs\amlds\lib\site-packages\azureml\data\dataset_error_handling.py:154, in _dataprep_error_handler(e, message, is_dprep_exception)
    152     for item in user_exception_list:
    153         if _contains(item, getattr(e, 'error_code', 'Unexpected')):
--> 154             raise UserErrorException(message, inner_exception=e)
    156 raise AzureMLException(message, inner_exception=e)

UserErrorException: UserErrorException:     Message: Execution failed with error message: ScriptExecutionException was caused by StreamAccessException.   StreamAccessException was caused by ValidationException.
    Unable to read file using Unicode (UTF-8). Attempted read range 0:777. Lines read in the range 0. Decoding error: [REDACTED]
      Failed due to inner exception of type: DecoderFallbackException | session_id=295acf7e-4af9-42f1-b04a-79f3c5a0f98c ErrorCode: ScriptExecution.StreamAccess.Validation  InnerException  Error Code: ScriptExecution.StreamAccess.Validation Validation Error Code: InvalidEncoding Validation Target: TextFile Failed Step: 78059bb0-278f-4c7f-9c21-01a0cccf7b96 Error Message: ScriptExecutionException was caused by StreamAccessException.   StreamAccessException was caused by ValidationException.
    Unable to read file using Unicode (UTF-8). Attempted read range 0:777. Lines read in the range 0. Decoding error: Unable to translate bytes [8B] at index 1 from specified code page to Unicode.
      Unable to translate bytes [8B] at index 1 from specified code page to Unicode. | session_id=295acf7e-4af9-42f1-b04a-79f3c5a0f98c  ErrorResponse  {
    "error": {
        "code": "UserError",
        "message": "Execution failed with error message: ScriptExecutionException was caused by StreamAccessException.\r\n  StreamAccessException was caused by ValidationException.\r\n    Unable to read file using Unicode (UTF-8). Attempted read range 0:777. Lines read in the range 0. Decoding error: [REDACTED]\r\n      Failed due to inner exception of type: DecoderFallbackException\r\n| session_id=295acf7e-4af9-42f1-b04a-79f3c5a0f98c ErrorCode: ScriptExecution.StreamAccess.Validation"
    } }

You can refer to [Error when using to_pandas_dataframe method on input datatsets ot a Run](https://github.com/Azure/MachineLearningNotebooks/issues/1436), and [TabularDataset Class](https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py#to-pandas-dataframe-on-error--null---out-of-range-datetime--null--) — Ecstasy, Apr 19 '22 at 04:25

score 0 · Answer 1 · answered Apr 21 '22 at 08:13

0

This kind of error usually happens if the base input is not our supported OS version.

Unable to read file using Unicode (UTF-8) -> this is the key point in the error occurred

str_value = raw_data.decode('utf-8')

using the above code block convert the input and then perform the operation.

answered Apr 21 '22 at 08:13

Sairam Tadepalli

1,563
1
3
11

Thanks for your reply. How do I get the raw data out of the TabularDataset? – Susan Apr 23 '22 at 01:30
By default, entire data which we extract from data source using from_* command will be the raw data. The EDA will help us to refine the data and we call it as pre-processing. The tabulardataset is formed from CSV, TSV, SQL and Parquet files. In general, we call them as the raw data. In the above code, raw_Data is considered as the input variable of the dataset from the source. – Sairam Tadepalli Apr 23 '22 at 11:53
I'm currently using a solution involving FileDataset. I'd be interested to know if there's a solution that doesn't involve downloading the files first though. I don't see from_* available as an exposed method on Dataset or TabularDataset. If there's a way to do this using non-internal methods, would you mind sharing the code? If not, I'll keep doing what I'm doing. – Susan Apr 25 '22 at 20:39
Can you please specify the extension of the dataset you are using? – Sairam Tadepalli Apr 26 '22 at 05:04
The extension of the files in the set is .json. – Susan May 14 '22 at 18:50
Specifically, each file can contain a list of JSON blob, with one per line. – Susan May 16 '22 at 21:46
Once check the mentioned link, which contains some of the examples of conversion. https://thispointer.com/convert-json-to-a-pandas-dataframe/ – Sairam Tadepalli May 18 '22 at 00:26
Thank you, but I'm able to load the data with a FileDataset. That requires downloading the data first. What I was hoping to do was load the data from a TabularDataset, so I don't first have to download all the files. I have a workaround, but I was hoping to save on COGS and processing time. If that can't be done, I'll just live with the workaround. Either way, thank you so much for all your help. – Susan May 19 '22 at 02:37

score 0 · Answer 2 · answered Aug 01 '22 at 08:04

Since you're working on a collection of .json files I'd suggest using a FileDataset (if you want to work with the jsons) as you're currently doing.

If you'd prefer working with the data in tabular form, then I'd suggest doing some preprocessing to flatten the json files into a pandas dataframe before saving it as a dataset on AzureML. Then use the register_pandas_dataframe method from the DatasetFactory class to save this dataframe. This will ensure that when you fetch the Dataset from azure, the to_pandas_dataframe() method will work. Just be aware that some datatypes such as numpy arrays are not supported when using the register_pandas_dataframe() method.

The issue with creating a tabular set from json files and then converting this to a pandas dataframe once you've begun working with it (in a run or notebook), is that you're expecting azure to handle the flattening/processing.

Alternatively, you can also look at the from_json_lines method since it might suit your use case better.

Azureml TabularDataset to_pandas_dataframe() returns InvalidEncoding error

2 Answers2