Problem with non-utf8 column preparations : Found unknown categories ['FÃ¨s-MeknÃ¨s'] during transform

Question

I tried to prepare Input and Output data for a characteristic selection problem but found a problem on certain columns that does not seem to be unicode:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-89-78f2cf157d88> in <module>
      1 # prepare input data
----> 2 X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
      3 # prepare output
      4 y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

<ipython-input-86-e63e5d5fad63> in prepare_inputs(X_train, X_test)
      3     oe.fit(X_train)
      4     X_train_enc = oe.transform(X_train)
----> 5     X_test_enc = oe.transform(X_test)
      6     return X_train_enc, X_test_enc
      7 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in transform(self, X)
    812 
    813         """
--> 814         X_int, _ = self._transform(X)
    815         return X_int.astype(self.dtype, copy=False)
    816 

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py in _transform(self, X, handle_unknown)
    105                     msg = ("Found unknown categories {0} in column {1}"
    106                            " during transform".format(diff, i))
--> 107                     raise ValueError(msg)
    108                 else:
    109                     # Set the problematic rows to an acceptable value and

ValueError: Found unknown categories ['FÃ¨s-MeknÃ¨s'] in column 4 during transform

Here's an excerpt from the columns:

    Do you agree    Gender  Age     City          Urban/Rural  Output
0   Yes             Female  25-34   Madrid        Urban        Will buy
1   No              Male    18-25   FÃ¨s-MeknÃ¨s  Rural        Won't
2   ...             ...     ...     ...      ...               Undecided
....

FÃ¨s-MeknÃ¨s should be Fès-Meknès.

Here is the code I did to get the data:

def load_dataset():
    connection = psycopg2.connect(user = "user",
                                  password = "passwd",
                                  host = "host",
                                  port = "5432",
                                  database = "database")

sql = "select * from capi limit 10;"
# load the table
df = pd.read_sql_query(sql, connection)
# retrieve numpy array
dataset = df.values

# split into input (X) and output (y) variables
cols = df.iloc[:,5:].columns.array
filtered_cols = ['TL_Segment']
cols = [col for col in cols if col not in filtered_cols]

X = df.loc[:, cols]  #independent columns
X = X.astype(str)
y = df['TL_Segment']    #target column i.e price range
return X.values, y.values

using the right encodind by running: `print conn.encoding`

I tried to add connection.set_client_encoding('UTF8') before querying but I still have the same issue

Not taking into account badly encoded rows

I tried not taking into account these rows with try catch:

def prepare_inputs(X_train, X_test):
    oe = OrdinalEncoder()
    oe.fit(X_train)
    try:
        X_train_enc = oe.transform(X_train)
        try: # imbricated in order not to return nothing in one of the two things returned
            X_test_enc = oe.transform(X_test)
        except ValueError as e:
            print(e)
    except ValueError as e:
        print(e)
    return X_train_enc, X_test_enc

But I still get the following:

Found unknown categories ['FÃ¨s-MeknÃ¨s'] in column 4 during transform

---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-126-78f2cf157d88> in <module>
      1 # prepare input data
----> 2 X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
      3 # prepare output
      4 y_train_enc, y_test_enc = prepare_targets(y_train, y_test)

<ipython-input-124-2376647ab46e> in prepare_inputs(X_train, X_test)
     10     except ValueError as e:
     11         print(e)
---> 12     return X_train_enc, X_test_enc
     13 

UnboundLocalError: local variable 'X_test_enc' referenced before assignment

On the contrary, it looks like the UTF8 file was loaded as if it were Latin1.`Ã¨` is how the two bytes used to represent `è` in UTF8 would appear if they were loaded as if they were individual characters. Whatever method you use to read the file, you need to specify the endoding, eg `endocing='utf-8'`. Post your code — Panagiotis Kanavos, Feb 05 '20 at 11:02
`Ã` is a giveaway - that's how the first byte in a UTF8 byte sequence for characters in the Latin1 range — Panagiotis Kanavos, Feb 05 '20 at 11:04
Which database are you using? What does the connection string look like? For some reason, some fields are loaded as non-Unicode text even though the contents are UTF8. This could mean a connection string setting is missing, or someone tried to store Unicode in an ASCII field, by using `varchar` instead of `nvarchar` for the field type — Panagiotis Kanavos, Feb 05 '20 at 11:14
@PanagiotisKanavos I've added the way I connect ot my database if you think the connection string miss parameters — Revolucion for Monica, Feb 05 '20 at 12:20
If you google for `psycopg2 UTF8` you'll see that psycopg2 doesn't use UTF8 by default, eg [Python psycopg2 not in utf-8](https://stackoverflow.com/questions/43583285/python-psycopg2-not-in-utf-8). Try `connection.set_client_encoding('UTF8')` before querying — Panagiotis Kanavos, Feb 05 '20 at 12:25
@PanagiotisKanavos Thank you for your insight ! Just tried but still have the save error — Revolucion for Monica, Feb 05 '20 at 12:29
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/207281/discussion-between-iggypass-and-panagiotis-kanavos). — Revolucion for Monica, Feb 05 '20 at 14:02

Problem with non-utf8 column preparations : Found unknown categories ['FÃ¨s-MeknÃ¨s'] during transform

using the right encodind by running: print conn.encoding

Not taking into account badly encoded rows

0 Answers0

using the right encodind by running: `print conn.encoding`