5

I am new to using Avro and Avro.jl in particular. I am having difficulty in writing a record and then appending a second record in a way that I can read it later. I am also in need of being able to read the files in python as well as Julia.

I chose to use Avro because it is row-based and I need to append a row( that is a record) one at a time as a simulation is running. CSV won't work as my data doesn't have fixed columns.

--

UPDATE: I am inserting here a simplified version of the problem I having.

The following code tried to write multiple records to one file and then read them. In Julia it only reads the first record. In python, I can read all of the records. I have tried using the Avro.writetable without much success.

import Avro,StructTypes,Tables,JSON3,Base
import Random
District=Dict{String,Int}
Districting=Dict{String,District}
Base.@kwdef struct Map
    name::String
    districting::Districting
    desc::String=""
    numLevels::Int=1
    
end
#
#
districting=Districting()
d=District()
t=()
#
keys=["P1","P2","P3","C1","B1"]
io=open("map1Test.avro","w")
#
rng = Random.MersenneTwister(1234);
#
for i=1:4
    for j=1:3
        for k in keys
            d[k]=convert(Int,floor(100*Random.rand(rng)))
        end
        districting[string("D",j)]=d
    end
    t=(name=string("map",i),levels=1,districting=districting)
    print("write : ",t,"\n")
    Avro.write(io,t)
    write(io,"\n")
end
close(io)
#
print("\n\n")
#
asc=Avro.schematype(typeof(t))
tType=typeof(t)
JSON3.write(asc)
io=open("mapAutoGen.avsc","w")
print("avsc: ",JSON3.write(asc),"\n")
print("type: ",tType,"\n")
write(io,JSON3.write(asc))
close(io)
#
print("\n\n")
rec=Any[]
io=open("map1Test.avro","r")
while eof(io)==false
    m=Avro.read(io,tType)
    print("reading :",m,"\n")
end
#
close(io)

The output is:

write : (name = "map1", levels = 1, districting = Dict("D2" => Dict("C1" => 95, "B1" => 64, "P2" => 1, "P1" => 64, "P3" => 6), "D3" => Dict("C1" => 95, "B1" => 64, "P2" => 1, "P1" => 64, "P3" => 6), "D1" => Dict("C1" => 95, "B1" => 64, "P2" => 1, "P1" => 64, "P3" => 6)))
write : (name = "map2", levels = 1, districting = Dict("D2" => Dict("C1" => 12, "B1" => 37, "P2" => 9, "P1" => 3, "P3" => 31), "D3" => Dict("C1" => 12, "B1" => 37, "P2" => 9, "P1" => 3, "P3" => 31), "D1" => Dict("C1" => 12, "B1" => 37, "P2" => 9, "P1" => 3, "P3" => 31)))
write : (name = "map3", levels = 1, districting = Dict("D2" => Dict("C1" => 30, "B1" => 37, "P2" => 69, "P1" => 4, "P3" => 36), "D3" => Dict("C1" => 30, "B1" => 37, "P2" => 69, "P1" => 4, "P3" => 36), "D1" => Dict("C1" => 30, "B1" => 37, "P2" => 69, "P1" => 4, "P3" => 36)))
write : (name = "map4", levels = 1, districting = Dict("D2" => Dict("C1" => 75, "B1" => 3, "P2" => 61, "P1" => 28, "P3" => 66), "D3" => Dict("C1" => 75, "B1" => 3, "P2" => 61, "P1" => 28, "P3" => 66), "D1" => Dict("C1" => 75, "B1" => 3, "P2" => 61, "P1" => 28, "P3" => 66)))


avsc: {"type":"record","name":"Record_8558689697622909467","fields":[{"name":"name","type":{"type":"string"},"order":"ascending"},{"name":"levels","type":{"type":"long"},"order":"ascending"},{"name":"districting","type":{"type":"map","values":{"type":"map","values":{"type":"long"}}},"order":"ascending"}]}
type: NamedTuple{(:name, :levels, :districting), Tuple{String, Int64, Dict{String, Dict{String, Int64}}}}


reading :(name = "map1", levels = 1, districting = Dict("D2" => Dict("C1" => 95, "B1" => 64, "P2" => 1, "P1" => 64, "P3" => 6), "D3" => Dict("C1" => 95, "B1" => 64, "P2" => 1, "P1" => 64, "P3" => 6), "D1" => Dict("C1" => 95, "B1" => 64, "P2" => 1, "P1" => 64, "P3" => 6)))

However the following python code reads the file and does what I want:

import fastavro
parsed_schema=fastavro.schema.load_schema("mapAutoGen.avsc")
fp=open('map1Test.avro', 'rb')
while True:
    try:
        record=fastavro.schemaless_reader(fp, parsed_schema)
        print(record)
    except:
        break

fp.close()  

the following output

output

{'name': 'map1', 'levels': 1, 'districting': {'D2': {'C1': 95, 'B1': 64, 'P2': 1, 'P1': 64, 'P3': 6}, 'D3': {'C1': 95, 'B1': 64, 'P2': 1, 'P1': 64, 'P3': 6}, 'D1': {'C1': 95, 'B1': 64, 'P2': 1, 'P1': 64, 'P3': 6}}}
{'name': '\x08map2', 'levels': 1, 'districting': {'D2': {'C1': 12, 'B1': 37, 'P2': 9, 'P1': 3, 'P3': 31}, 'D3': {'C1': 12, 'B1': 37, 'P2': 9, 'P1': 3, 'P3': 31}, 'D1': {'C1': 12, 'B1': 37, 'P2': 9, 'P1': 3, 'P3': 31}}}
{'name': '\x08map3', 'levels': 1, 'districting': {'D2': {'C1': 30, 'B1': 37, 'P2': 69, 'P1': 4, 'P3': 36}, 'D3': {'C1': 30, 'B1': 37, 'P2': 69, 'P1': 4, 'P3': 36}, 'D1': {'C1': 30, 'B1': 37, 'P2': 69, 'P1': 4, 'P3': 36}}}
{'name': '\x08map4', 'levels': 1, 'districting': {'D2': {'C1': 75, 'B1': 3, 'P2': 61, 'P1': 28, 'P3': 66}, 'D3': {'C1': 75, 'B1': 3, 'P2': 61, 'P1': 28, 'P3': 66}, 'D1': {'C1': 75, 'B1': 3, 'P2': 61, 'P1': 28, 'P3': 66}}}

--

I have made a few experiments below which I hope will explain the issues I am having.

Here is the summary:

  1. Experient 1 : I can write and read a single record in Julia, but can't read in python
  2. Exp 2: I can write and read two records at one in Julia and read them as a single record in julia. Python don't read the file
  3. Exp 3: Write two records in julia (one after another) and then read them in Julia (all at once). No luck in python.
  4. Using writetable, write two records at once and read them in both julia and python. In julia they are read all at once. In python 3, one at a time.
  5. Using writetable, make two writes to a file. Julia fails to read anything. Python manages to read the first record by the second.

In summary, I want to be able to a record which is a dictionary (or a tuple containing a dictionary), then close the file, reopen it and then write another record. Then I want to be able to write in Julia or python.

Experiment 1

import Avro,StructTypes,Tables,JSON3,Base
display(smap)
display(smap2)

output:

Dict{String, Int64} with 25 entries:
  "[\"ROWAN\",\"32\",\"371590519022049\"]"         => 4
  "[\"RUTHERFORD\",\"05A\",\"371619611022065\"]"   => 7
  "[\"RANDOLPH\",\"AN\",\"371510303011002\"]"      => 2
  "[\"CLEVELAND\",\"POLKVL\",\"370459501022046\"]" => 9

Dict{String, Int64} with 25 entries:
  "[\"ROWAN\",\"32\",\"371590519022049\"]"         => 4
  "[\"RANDOLPH\",\"AN\",\"371510303011002\"]"      => 2
  "[\"GRANVILLE\",\"TYHO\",\"370779706012012\"]"   => 1


io=open("map-1.avro","w")
Avro.write(io,smap)
close(io)
io=open("map-1.avro","r")
while eof(io)==false
  m=Avro.read(io,typeof(smap2))
  display(Dict(m))
end
close(io)

output

Dict{String, Int64} with 25 entries:
  "[\"ROWAN\",\"32\",\"371590519022049\"]"         => 4
  "[\"RUTHERFORD\",\"05A\",\"371619611022065\"]"   => 7
  "[\"RANDOLPH\",\"AN\",\"371510303011002\"]"      => 2
  "[\"CLEVELAND\",\"POLKVL\",\"370459501022046\"]" => 9

My attempts to read the file into python fail.

sch2=fastavro.schema.load_schema("rec2.avsc")
  with open('map-1.avro', 'rb') as fo:
     avro_reader = fastavro.reader(fo,sch2)
     for record in avro_reader:
       print(record)
       print("---------\n")

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-65-c1108c8faa22> in <module>
      1 sch2=fastavro.schema.load_schema("rec2.avsc")
      2 with open('map-1.avro', 'rb') as fo:
----> 3     avro_reader = fastavro.reader(fo,sch2)
      4     for record in avro_reader:
      5         print(record)

*output from python*

fastavro/_read.pyx in fastavro._read.reader.__init__()

fastavro/_read.pyx in fastavro._read.file_reader.__init__()

fastavro/_read.pyx in fastavro._read._read_data()

fastavro/_read.pyx in fastavro._read.read_record()

fastavro/_read.pyx in fastavro._read._read_data()

fastavro/_read.pyx in fastavro._read.read_map()

fastavro/_read.pyx in fastavro._read._read_data()

fastavro/_read.pyx in fastavro._read.read_bytes()

ValueError: read length must be non-negative or -1

Experiment 2

io=open("map-3.avro","w"
d=Dict("map1"=> smap,"map2"=>smap2)
Avro.write(io,d)
close(io)
io=open("map-3.avro","r")
while eof(io)==false
   m=Avro.read(io,typeof(d))
   display(Dict(m))
end
close(io)

Output 2

 Dict{String, Dict{String, Int64}} with 2 entries:
  "map2" => Dict("[\"ROWAN\",\"32\",\"371590519022049\"]"=>4, "[\"RANDOLPH\",\"…
  "map1" => Dict("[\"ROWAN\",\"32\",\"371590519022049\"]"=>4, "[\"RUTHERFORD\",…

Experiment 3

io=open("map-4.avro","w")
Avro.write(io,smap)
Avro.write(io,smap2)
close(io)
io=open("map-4.avro","r")
while eof(io)==false
    m=Avro.read(io,typeof(smap))
    display(Dict(m))
end
close(io)

output

Dict{String, Int64} with 25 entries:
  "[\"ROWAN\",\"32\",\"371590519022049\"]"         => 4
  "[\"RUTHERFORD\",\"05A\",\"371619611022065\"]"   => 7
  "[\"RANDOLPH\",\"AN\",\"371510303011002\"]"      => 2
  "[\"CLEVELAND\",\"POLKVL\",\"370459501022046\"]" => 9
  "[\"GRANVILLE\",\"TYHO\",\"370779706012012\"]"   => 1

Experiment 4

s=Avro.parseschema("rec.avsc")
d=Dict("map1"=>smap,"map2"=>smap2)
io=open("map-table.avro","w")
it=Avro.writetable(io,d,sch=s)
close(io)
io=open("map-table.avro","r")
while eof(io)==false
   m=Avro.readtable(io)
   display(Dict(m)["map1"])
   display(Dict(m)["map2"])

end close(io)

output

Dict{String, Int64} with 25 entries:
  "[\"ROWAN\",\"32\",\"371590519022049\"]"         => 4
  "[\"RUTHERFORD\",\"05A\",\"371619611022065\"]"   => 7
  "[\"RANDOLPH\",\"AN\",\"371510303011002\"]"      => 2
  "[\"CLEVELAND\",\"POLKVL\",\"370459501022046\"]" => 9
 
Dict{String, Int64} with 25 entries:
  "[\"ROWAN\",\"32\",\"371590519022049\"]"         => 4
  "[\"RANDOLPH\",\"AN\",\"371510303011002\"]"      => 2
  "[\"GRANVILLE\",\"TYHO\",\"370779706012012\"]"   => 1
 

python code

with open('map-table.avro', 'rb') as fo:
avro_reader = fastavro.reader(fo,sch)
for record in avro_reader:
    print(record)
    print("---------\n")
    

python output

{'first': 'map2', 'second': {'["ROWAN","32","371590519022049"]': 4, '["RANDOLPH","AN","371510303011002"]': 2, '["GRANVILLE","TYHO","370779706012012"]': 1, '["GUILFORD","G66"]': 3, '["MONTGOMERY","T2"]': 4, '["CLEVELAND","POLKVL","370459501022038"]': 9, '["DURHAM","28"]': 11}}
---------

{'first': 'map1', 'second': {'["ROWAN","32","371590519022049"]': 4, '["RUTHERFORD","05A","371619611022065"]': 7, '["RANDOLPH","AN","371510303011002"]': 2, '["CLEVELAND","POLKVL","370459501022046"]': 9, '["GRANVILLE","TYHO","370779706012012"]':13}}
---------

Experiment 4

print("Writing \n")
io=open("map-table2.avro","w")
d=Dict("map1"=>smap)
it=Avro.writetable(io,d,sch=s)
d=Dict("map2"=>smap2)
it=Avro.writetable(io,d,sch=s)
close(io)
print("reading \n")
io=open("map-table2.avro","r")
while eof(io)==false   
     print("-----\n")
     m=Avro.readtable(io)
     display(Dict(m))
end
close(io)

output

Writing reading

ArgumentError: invalid Array dimensions

Stacktrace:
 [1] Array
   @ ./boot.jl:448 [inlined]
 [2] readwithschema(#unused#::Type{Avro.Record{(:first, :second), Tuple{String, 

Dict{String, Int64}}, 2}}, sch::Avro.RecordType, buf::Vector{UInt8}, pos::Int64, comp::Nothing)
       @ Avro ~/.julia/packages/Avro/JEoRa/src/tables.jl:176
     [3] readtable(buf::Vector{UInt8}, pos::Int64, kw::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
       @ Avro ~/.julia/packages/Avro/JEoRa/src/tables.jl:166
     [4] #readtable#33
       @ ~/.julia/packages/Avro/JEoRa/src/tables.jl:156 [inlined]
     [5] readtable(io::IOStream)
       @ Avro ~/.julia/packages/Avro/JEoRa/src/tables.jl:156
     [6] top-level scope
       @ ./In[69]:13
     [7] eval
       @ ./boot.jl:360 [inlined]
     [8] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
       @ Base ./loading.jl:1094

Python Code

with open('map-table2.avro', 'rb') as fo:
avro_reader = fastavro.reader(fo,sch)
for record in avro_reader:
    print(record)
    print("---------\n")




  {'first': 'map1', 'second': {'["ROWAN","32","371590519022049"]': 4, '["RUTHERFORD","05A","371619611022065"]': 7, '["RANDOLPH","AN","371510303011002"]': 2, '["CLEVELAND","POLKVL","370459501022046"]': 9}
    ---------

Here are the two schema files:

rec2.avsc

{"type": "record",
"name": "Record_366361102733404708",
"fields": [
 {"type": "map", "values": {"type": "long"},
  "order": "ascending"}]}

rec.avsc

{"type": "record","name": "Record_366361102733404708","fields": [{"name": "first",  "type": {"type": "string"},  "order": "ascending"}, {"name": "second",  "type": {"type": "map", "values": {"type": "long"}},   "order": "ascending"}]}

Any help would be very welcome. I am willing to change the exact look of what I am writing if I am not following best practices.

  • Hi Jonathan! I'd like to have a go at fixing this, but it seems very difficult without any data. Would you be able to share some of the data with me? Enough that I can reproduce your results. Cheers! – Mark Jun 18 '23 at 05:41

0 Answers0