4

I am trying to convert csv files stored in azure data lake store into avro files with created scheme. Is there any kind of example source code which has same purpose?

emkay
  • 187
  • 12
  • 1
    Is the question still relevant? if so, can you provide more details: 1. how csv shall be converted to avro: shall each field type be inferred somehow, or can you say all field types are number or string. Do you want it to be a field per csv column or an avro array for each row? 2. which language do you want to use? is C OK for that? – Eliyahu Machluf Sep 12 '19 at 09:59
  • If you are looking to work with a pre-created schema and use it to convert ```csv``` files into ```Avro```, I think ```apache``` does offer libraries for it. – Amit Singh Sep 12 '19 at 14:24

3 Answers3

4

You can use Azure Data Lake Analytics for this. There is a sample Avro extractor at https://github.com/Azure/usql/blob/master/Examples/DataFormats/Microsoft.Analytics.Samples.Formats/Avro/AvroExtractor.cs. You can easily adapt the code into an outputter.

Another possibility is to fire up an HDInsight cluster on top of your data lake store and use Pig, Hive or Spark.

3

That's actually pretty straightforward to do with Azure Data Factory and Blob Storage. This should be also very cheap because you pay per second when executing in ADF so you only pay for conversion time. No infra required.

If your CSV looks like this

ID,Name,Surname
1,Adam,Marczak
2,Tom,Kowalski
3,John,Johnson

Upload it to blob storage into input container

enter image description here

Add linked service for blob storage in ADF

  1. enter image description here
  2. enter image description here
  3. enter image description here

Select your storage

enter image description here

Add dataset

enter image description here

Of blob type

enter image description here

And set it to CSV format

enter image description here

With values as such

enter image description here

Add another dataset

enter image description here

Of blob type

enter image description here

And select Avro type

enter image description here

With value likes

enter image description here

Add pipeline

enter image description here

Drag-n-drop Copy Data activity

enter image description here

And in the source select your CSV input dataset

enter image description here

And in the sink select your target Avro dataset

enter image description here

And publish and trigger the pipeline

enter image description here enter image description here

With output

enter image description here

And on the blob

enter image description here

And with inspection you can see Avro file

enter image description here

Full github code here https://github.com/MarczakIO/azure-datafactory-csv-to-avro

If you want to learn about data factory check out ADF introduction video https://youtu.be/EpDkxTHAhOs

And if you want to dynamically pass input and output paths to blob files check out video on parametrization of ADF video https://youtu.be/pISBgwrdxPM

Adam Marczak
  • 2,257
  • 9
  • 20
2

Python is always your best friend. Please use this sample code to convert csv to avro:

Install these dependencies:

pip install fastavro
pip install pandas

Execute the following python script.

from fastavro import writer, parse_schema
import pandas as pd

# Read CSV
df = pd.read_csv('sample.csv')

# Define AVRO schema
schema = {
    'doc': 'Documentation',
    'name': 'Subject',
    'namespace': 'test',
    'type': 'record',
    'fields': [{'name': c, 'type': 'string'} for c in df.columns]
}
parsed_schema = parse_schema(schema)

# Writing AVRO file
with open('sample.avro', 'wb') as out:
    writer(out, parsed_schema, df.to_dict('records'))

input: sample.csv

col1,col2,col3
a,b,c
d,e,f
g,h,i

output: sample.avro

Objavro.codecnullavro.schemaƒ{"type": "record", "name": "test.Subject", "fields": [{"name": "col1", "type": "string"}, {"name": "col2", "type": "string"}, {"name": "col3", "type": "string"}]}Y«Ÿ>[Ú   Ÿÿ  Æ?âQI$abcdefghiY«Ÿ>[Ú   Ÿÿ  Æ?âQI