0

I'm building recommender system using AWS Personalize. User-personalization recipe has 3 dataset inputs: interactions, user_metadata and item_metadata. I am having trouble importing user metadata which contains boolean field. I created the following schema:

user_schema = {
  "type": "record",
  "name": "Users",
  "namespace": "com.amazonaws.personalize.schema",
  "fields": [
      {
          "name": "USER_ID",
          "type": "string"
      },
      {
          "name": "type",
          "type": [
            "null",
            "string"
          ],
          "categorical": True
      },
      {
          "name": "lang",
          "type": [
            "null",
            "string"
          ],
          "categorical": True
      },
      {
          "name": "is_active",
          "type": "boolean"
      }
  ],
  "version": "1.0"
}

dataset csv file content looks like:

USER_ID,type,lang,is_active
1234@gmail.com ,,geo,True
01027061015@mail.ru ,facebook,eng,True
03dadahda@gmail.com ,facebook,geo,True
040168fadw@gmail.com ,facebook,geo,False

I uploaded given csv file on s3 bucket. When I am trying create dataset import job it gives me the following exception:

InvalidInputException: An error occurred (InvalidInputException) when calling the CreateDatasetImportJob operation: Input csv has rows that do not conform to the dataset schema. Please ensure all required data fields are present and that they are of the type specified in the schema.

I tested and it works without boolean field is_active. There are no NaN values in given column!

It'd be nice to have an ability to directly test if your pandas dataframe or csv file conforms given schema and possibly get more detailed error message.

Does anybody know how to format boolean field to fix that issue?

1 Answers1

1

I found a solution through many trials. Checked the AWS Personalization documentation (https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html#dataset-requirements) which says that: boolean (values true and false must be lower case in your data).

Then I tried several things to find a solution, and one of them really worked. But still the hard way to find a solution and spent hours.

Solution:

  1. Convert column in pandas DataFrame into string (Object) format.
  2. lowercase True and False string values to get true and false.
  3. store pandas DataFrame as csv file

it results in lowercase values of true and false.

USER_ID,type,lang,is_active
1234@gmail.com ,,geo,true
01027061015@mail.ru ,facebook,eng,true
03dadahda@gmail.com ,facebook,geo,true
040168fadw@gmail.com ,facebook,geo,false

That's all! There is no need to change "boolean" type in schema to "string"! Hopefully they'll solve that issue soon since I contacted with AWS technical support with the same issue.