20

I have this CSV file:

reference,address
V7T452F4H9,"12410 W 62TH ST, AA D"

The following options are being used in the table definition

ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ( 
  'quoteChar'='\"', 
  'separatorChar'=',') 

but it still won't recognize the double quotes in the data, and that comma in the double quote fiel is messing up the data. When I run the Athena query, the result looks like this

reference     address
V7T452F4H9    "12410 W 62TH ST

How do I fix this issue?

Piotr Findeisen
  • 19,480
  • 2
  • 52
  • 82
ln9187
  • 730
  • 1
  • 7
  • 23
  • In case it saves someone else a bunch of searching, this is how I fixed it for my issue: https://stackoverflow.com/questions/57498330/specify-a-serde-serialization-lib-with-aws-glue-crawler Including the first comment to delete fields before re-running – Mike Karp Jul 11 '22 at 02:05

2 Answers2

12

I do this to solve:

1 - Create a Crawler that doesn't overwrite the target table properties, I used boto3 for this but it can be created in AWS console too, by doing this (change the xxx-var):

import boto3

client = boto3.client('glue')

response = client.create_crawler(
    Name='xxx-Crawler-Name',
    Role='xxx-Put-here-your-rol',
    DatabaseName='xxx-databaseName',
    Description='xxx-Crawler description if u need it',
    Targets={
        'S3Targets': [
            {
                'Path': 's3://xxx-Path-to-s3/',
                'Exclusions': [
                ]
            },
        ]
    },
    SchemaChangePolicy={
        'UpdateBehavior': 'LOG',
        'DeleteBehavior': 'LOG'
    },
    Configuration='{ \
        "Version": 1.0, \
        "CrawlerOutput": { \
            "Partitions": {"AddOrUpdateBehavior": "InheritFromTable" \
            }, \
            "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } \
        }  \
    }'
)

# run the crawler
response = client.start_crawler(
    Name='xxx-Crawler-Name'
)

2 - Edit the serialization lib, I do this in AWS Console like say this post (https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-csv-quotes) just change this:

org.apache.hadoop.hive.serde2.OpenCSVSerde

3 - Run Crawler again. Run the crawler as always do:

Using AWS Console

4 - That's it, your 2nd run should not change any data in the table, it's just for testing that it's works ¯\(ツ)/¯.

SwissCodeMen
  • 4,222
  • 8
  • 24
  • 34
Daniel I. Cruz
  • 1,235
  • 1
  • 12
  • 13
11

Look like you also need to add escapeChar. AWS Athena docs shows this example:

CREATE EXTERNAL TABLE myopencsvtable (
   col1 string,
   col2 string,
   col3 string,
   col4 string
)
ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar' = '\"',
   'escapeChar' = '\\'
   )
STORED AS TEXTFILE
LOCATION 's3://location/of/csv/';
botchniaque
  • 4,698
  • 3
  • 35
  • 63