I am new to using Avro and Avro.jl in particular. I am having difficulty in writing a record and then appending a second record in a way that I can read it later. I am also in need of being able to read the files in python as well as Julia.
I chose to use Avro because it is row-based and I need to append a row( that is a record) one at a time as a simulation is running. CSV won't work as my data doesn't have fixed columns.
--
UPDATE: I am inserting here a simplified version of the problem I having.
The following code tried to write multiple records to one file and then read them. In Julia it only reads the first record. In python, I can read all of the records. I have tried using the Avro.writetable without much success.
import Avro,StructTypes,Tables,JSON3,Base
import Random
District=Dict{String,Int}
Districting=Dict{String,District}
Base.@kwdef struct Map
name::String
districting::Districting
desc::String=""
numLevels::Int=1
end
#
#
districting=Districting()
d=District()
t=()
#
keys=["P1","P2","P3","C1","B1"]
io=open("map1Test.avro","w")
#
rng = Random.MersenneTwister(1234);
#
for i=1:4
for j=1:3
for k in keys
d[k]=convert(Int,floor(100*Random.rand(rng)))
end
districting[string("D",j)]=d
end
t=(name=string("map",i),levels=1,districting=districting)
print("write : ",t,"\n")
Avro.write(io,t)
write(io,"\n")
end
close(io)
#
print("\n\n")
#
asc=Avro.schematype(typeof(t))
tType=typeof(t)
JSON3.write(asc)
io=open("mapAutoGen.avsc","w")
print("avsc: ",JSON3.write(asc),"\n")
print("type: ",tType,"\n")
write(io,JSON3.write(asc))
close(io)
#
print("\n\n")
rec=Any[]
io=open("map1Test.avro","r")
while eof(io)==false
m=Avro.read(io,tType)
print("reading :",m,"\n")
end
#
close(io)
The output is:
write : (name = "map1", levels = 1, districting = Dict("D2" => Dict("C1" => 95, "B1" => 64, "P2" => 1, "P1" => 64, "P3" => 6), "D3" => Dict("C1" => 95, "B1" => 64, "P2" => 1, "P1" => 64, "P3" => 6), "D1" => Dict("C1" => 95, "B1" => 64, "P2" => 1, "P1" => 64, "P3" => 6)))
write : (name = "map2", levels = 1, districting = Dict("D2" => Dict("C1" => 12, "B1" => 37, "P2" => 9, "P1" => 3, "P3" => 31), "D3" => Dict("C1" => 12, "B1" => 37, "P2" => 9, "P1" => 3, "P3" => 31), "D1" => Dict("C1" => 12, "B1" => 37, "P2" => 9, "P1" => 3, "P3" => 31)))
write : (name = "map3", levels = 1, districting = Dict("D2" => Dict("C1" => 30, "B1" => 37, "P2" => 69, "P1" => 4, "P3" => 36), "D3" => Dict("C1" => 30, "B1" => 37, "P2" => 69, "P1" => 4, "P3" => 36), "D1" => Dict("C1" => 30, "B1" => 37, "P2" => 69, "P1" => 4, "P3" => 36)))
write : (name = "map4", levels = 1, districting = Dict("D2" => Dict("C1" => 75, "B1" => 3, "P2" => 61, "P1" => 28, "P3" => 66), "D3" => Dict("C1" => 75, "B1" => 3, "P2" => 61, "P1" => 28, "P3" => 66), "D1" => Dict("C1" => 75, "B1" => 3, "P2" => 61, "P1" => 28, "P3" => 66)))
avsc: {"type":"record","name":"Record_8558689697622909467","fields":[{"name":"name","type":{"type":"string"},"order":"ascending"},{"name":"levels","type":{"type":"long"},"order":"ascending"},{"name":"districting","type":{"type":"map","values":{"type":"map","values":{"type":"long"}}},"order":"ascending"}]}
type: NamedTuple{(:name, :levels, :districting), Tuple{String, Int64, Dict{String, Dict{String, Int64}}}}
reading :(name = "map1", levels = 1, districting = Dict("D2" => Dict("C1" => 95, "B1" => 64, "P2" => 1, "P1" => 64, "P3" => 6), "D3" => Dict("C1" => 95, "B1" => 64, "P2" => 1, "P1" => 64, "P3" => 6), "D1" => Dict("C1" => 95, "B1" => 64, "P2" => 1, "P1" => 64, "P3" => 6)))
However the following python code reads the file and does what I want:
import fastavro
parsed_schema=fastavro.schema.load_schema("mapAutoGen.avsc")
fp=open('map1Test.avro', 'rb')
while True:
try:
record=fastavro.schemaless_reader(fp, parsed_schema)
print(record)
except:
break
fp.close()
the following output
output
{'name': 'map1', 'levels': 1, 'districting': {'D2': {'C1': 95, 'B1': 64, 'P2': 1, 'P1': 64, 'P3': 6}, 'D3': {'C1': 95, 'B1': 64, 'P2': 1, 'P1': 64, 'P3': 6}, 'D1': {'C1': 95, 'B1': 64, 'P2': 1, 'P1': 64, 'P3': 6}}}
{'name': '\x08map2', 'levels': 1, 'districting': {'D2': {'C1': 12, 'B1': 37, 'P2': 9, 'P1': 3, 'P3': 31}, 'D3': {'C1': 12, 'B1': 37, 'P2': 9, 'P1': 3, 'P3': 31}, 'D1': {'C1': 12, 'B1': 37, 'P2': 9, 'P1': 3, 'P3': 31}}}
{'name': '\x08map3', 'levels': 1, 'districting': {'D2': {'C1': 30, 'B1': 37, 'P2': 69, 'P1': 4, 'P3': 36}, 'D3': {'C1': 30, 'B1': 37, 'P2': 69, 'P1': 4, 'P3': 36}, 'D1': {'C1': 30, 'B1': 37, 'P2': 69, 'P1': 4, 'P3': 36}}}
{'name': '\x08map4', 'levels': 1, 'districting': {'D2': {'C1': 75, 'B1': 3, 'P2': 61, 'P1': 28, 'P3': 66}, 'D3': {'C1': 75, 'B1': 3, 'P2': 61, 'P1': 28, 'P3': 66}, 'D1': {'C1': 75, 'B1': 3, 'P2': 61, 'P1': 28, 'P3': 66}}}
--
I have made a few experiments below which I hope will explain the issues I am having.
Here is the summary:
- Experient 1 : I can write and read a single record in Julia, but can't read in python
- Exp 2: I can write and read two records at one in Julia and read them as a single record in julia. Python don't read the file
- Exp 3: Write two records in julia (one after another) and then read them in Julia (all at once). No luck in python.
- Using writetable, write two records at once and read them in both julia and python. In julia they are read all at once. In python 3, one at a time.
- Using writetable, make two writes to a file. Julia fails to read anything. Python manages to read the first record by the second.
In summary, I want to be able to a record which is a dictionary (or a tuple containing a dictionary), then close the file, reopen it and then write another record. Then I want to be able to write in Julia or python.
Experiment 1
import Avro,StructTypes,Tables,JSON3,Base
display(smap)
display(smap2)
output:
Dict{String, Int64} with 25 entries:
"[\"ROWAN\",\"32\",\"371590519022049\"]" => 4
"[\"RUTHERFORD\",\"05A\",\"371619611022065\"]" => 7
"[\"RANDOLPH\",\"AN\",\"371510303011002\"]" => 2
"[\"CLEVELAND\",\"POLKVL\",\"370459501022046\"]" => 9
Dict{String, Int64} with 25 entries:
"[\"ROWAN\",\"32\",\"371590519022049\"]" => 4
"[\"RANDOLPH\",\"AN\",\"371510303011002\"]" => 2
"[\"GRANVILLE\",\"TYHO\",\"370779706012012\"]" => 1
io=open("map-1.avro","w")
Avro.write(io,smap)
close(io)
io=open("map-1.avro","r")
while eof(io)==false
m=Avro.read(io,typeof(smap2))
display(Dict(m))
end
close(io)
output
Dict{String, Int64} with 25 entries:
"[\"ROWAN\",\"32\",\"371590519022049\"]" => 4
"[\"RUTHERFORD\",\"05A\",\"371619611022065\"]" => 7
"[\"RANDOLPH\",\"AN\",\"371510303011002\"]" => 2
"[\"CLEVELAND\",\"POLKVL\",\"370459501022046\"]" => 9
My attempts to read the file into python fail.
sch2=fastavro.schema.load_schema("rec2.avsc")
with open('map-1.avro', 'rb') as fo:
avro_reader = fastavro.reader(fo,sch2)
for record in avro_reader:
print(record)
print("---------\n")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-65-c1108c8faa22> in <module>
1 sch2=fastavro.schema.load_schema("rec2.avsc")
2 with open('map-1.avro', 'rb') as fo:
----> 3 avro_reader = fastavro.reader(fo,sch2)
4 for record in avro_reader:
5 print(record)
*output from python*
fastavro/_read.pyx in fastavro._read.reader.__init__()
fastavro/_read.pyx in fastavro._read.file_reader.__init__()
fastavro/_read.pyx in fastavro._read._read_data()
fastavro/_read.pyx in fastavro._read.read_record()
fastavro/_read.pyx in fastavro._read._read_data()
fastavro/_read.pyx in fastavro._read.read_map()
fastavro/_read.pyx in fastavro._read._read_data()
fastavro/_read.pyx in fastavro._read.read_bytes()
ValueError: read length must be non-negative or -1
Experiment 2
io=open("map-3.avro","w"
d=Dict("map1"=> smap,"map2"=>smap2)
Avro.write(io,d)
close(io)
io=open("map-3.avro","r")
while eof(io)==false
m=Avro.read(io,typeof(d))
display(Dict(m))
end
close(io)
Output 2
Dict{String, Dict{String, Int64}} with 2 entries:
"map2" => Dict("[\"ROWAN\",\"32\",\"371590519022049\"]"=>4, "[\"RANDOLPH\",\"…
"map1" => Dict("[\"ROWAN\",\"32\",\"371590519022049\"]"=>4, "[\"RUTHERFORD\",…
Experiment 3
io=open("map-4.avro","w")
Avro.write(io,smap)
Avro.write(io,smap2)
close(io)
io=open("map-4.avro","r")
while eof(io)==false
m=Avro.read(io,typeof(smap))
display(Dict(m))
end
close(io)
output
Dict{String, Int64} with 25 entries:
"[\"ROWAN\",\"32\",\"371590519022049\"]" => 4
"[\"RUTHERFORD\",\"05A\",\"371619611022065\"]" => 7
"[\"RANDOLPH\",\"AN\",\"371510303011002\"]" => 2
"[\"CLEVELAND\",\"POLKVL\",\"370459501022046\"]" => 9
"[\"GRANVILLE\",\"TYHO\",\"370779706012012\"]" => 1
Experiment 4
s=Avro.parseschema("rec.avsc")
d=Dict("map1"=>smap,"map2"=>smap2)
io=open("map-table.avro","w")
it=Avro.writetable(io,d,sch=s)
close(io)
io=open("map-table.avro","r")
while eof(io)==false
m=Avro.readtable(io)
display(Dict(m)["map1"])
display(Dict(m)["map2"])
end close(io)
output
Dict{String, Int64} with 25 entries:
"[\"ROWAN\",\"32\",\"371590519022049\"]" => 4
"[\"RUTHERFORD\",\"05A\",\"371619611022065\"]" => 7
"[\"RANDOLPH\",\"AN\",\"371510303011002\"]" => 2
"[\"CLEVELAND\",\"POLKVL\",\"370459501022046\"]" => 9
Dict{String, Int64} with 25 entries:
"[\"ROWAN\",\"32\",\"371590519022049\"]" => 4
"[\"RANDOLPH\",\"AN\",\"371510303011002\"]" => 2
"[\"GRANVILLE\",\"TYHO\",\"370779706012012\"]" => 1
python code
with open('map-table.avro', 'rb') as fo:
avro_reader = fastavro.reader(fo,sch)
for record in avro_reader:
print(record)
print("---------\n")
python output
{'first': 'map2', 'second': {'["ROWAN","32","371590519022049"]': 4, '["RANDOLPH","AN","371510303011002"]': 2, '["GRANVILLE","TYHO","370779706012012"]': 1, '["GUILFORD","G66"]': 3, '["MONTGOMERY","T2"]': 4, '["CLEVELAND","POLKVL","370459501022038"]': 9, '["DURHAM","28"]': 11}}
---------
{'first': 'map1', 'second': {'["ROWAN","32","371590519022049"]': 4, '["RUTHERFORD","05A","371619611022065"]': 7, '["RANDOLPH","AN","371510303011002"]': 2, '["CLEVELAND","POLKVL","370459501022046"]': 9, '["GRANVILLE","TYHO","370779706012012"]':13}}
---------
Experiment 4
print("Writing \n")
io=open("map-table2.avro","w")
d=Dict("map1"=>smap)
it=Avro.writetable(io,d,sch=s)
d=Dict("map2"=>smap2)
it=Avro.writetable(io,d,sch=s)
close(io)
print("reading \n")
io=open("map-table2.avro","r")
while eof(io)==false
print("-----\n")
m=Avro.readtable(io)
display(Dict(m))
end
close(io)
output
Writing reading
ArgumentError: invalid Array dimensions
Stacktrace:
[1] Array
@ ./boot.jl:448 [inlined]
[2] readwithschema(#unused#::Type{Avro.Record{(:first, :second), Tuple{String,
Dict{String, Int64}}, 2}}, sch::Avro.RecordType, buf::Vector{UInt8}, pos::Int64, comp::Nothing)
@ Avro ~/.julia/packages/Avro/JEoRa/src/tables.jl:176
[3] readtable(buf::Vector{UInt8}, pos::Int64, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ Avro ~/.julia/packages/Avro/JEoRa/src/tables.jl:166
[4] #readtable#33
@ ~/.julia/packages/Avro/JEoRa/src/tables.jl:156 [inlined]
[5] readtable(io::IOStream)
@ Avro ~/.julia/packages/Avro/JEoRa/src/tables.jl:156
[6] top-level scope
@ ./In[69]:13
[7] eval
@ ./boot.jl:360 [inlined]
[8] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
@ Base ./loading.jl:1094
Python Code
with open('map-table2.avro', 'rb') as fo:
avro_reader = fastavro.reader(fo,sch)
for record in avro_reader:
print(record)
print("---------\n")
{'first': 'map1', 'second': {'["ROWAN","32","371590519022049"]': 4, '["RUTHERFORD","05A","371619611022065"]': 7, '["RANDOLPH","AN","371510303011002"]': 2, '["CLEVELAND","POLKVL","370459501022046"]': 9}
---------
Here are the two schema files:
rec2.avsc
{"type": "record",
"name": "Record_366361102733404708",
"fields": [
{"type": "map", "values": {"type": "long"},
"order": "ascending"}]}
rec.avsc
{"type": "record","name": "Record_366361102733404708","fields": [{"name": "first", "type": {"type": "string"}, "order": "ascending"}, {"name": "second", "type": {"type": "map", "values": {"type": "long"}}, "order": "ascending"}]}
Any help would be very welcome. I am willing to change the exact look of what I am writing if I am not following best practices.