2

I have curl command, whose input I want to load using BSON. For performance reason, I want to read the curl output directly to memory, without saving it to file. Also, I want to close the curl as soon as possible, so I want to read data from curl and then to pass them to BSON, we had some problems when curl was open because it was faster than consecutive parsing.

I know this works, but it keeps curl open for too long, which causes problems when we do this in parallel many times at once and the server where we download from is a bit busy.

using BSON
cmd = `curl <some data>`
BSON.load(open(cmd))

To close cmd ASAP, I have this:

# created IOBuffer to wrap bytes
import BSON.load
function BSON.load(bytes::Vector{UInt8})
    io = IOBuffer()
    write(io, bytes)
    seekstart(io)
    BSON.load(io)
end
cmd = `curl <some data>`
BSON.load(read(cmd))

which works, but I consider it very ugly. Also I'm not sure if this doesn't have some performance penalty.

Is there a more elegant way to do this? Can I read(cmd) into some IO structure, which could be then passed to BSON.load?

I realized the exactly same problem holds for Serialization.deserialize. My solution for deserialization is same, but I welcome any improvoements.

Matěj Račinský
  • 1,679
  • 1
  • 16
  • 28
  • When you say "but it keeps curl open for too long, which causes problems when we do this in parallel many times at once and the server where we download from is a bit busy" do you mean that BSON.load is slower than the download so it blocks the download process? – StefanKarpinski Sep 09 '20 at 16:16
  • Yes, exactly. But mostly it happened during JSON.parse, where parsing json took longer than downloading via curl. We have object storage which sits in same data center as the server with Julia, so it's pretty fast. – Matěj Račinský Sep 10 '20 at 07:13

1 Answers1

2

It's a little unclear what your question means when you say that it "keeps curl open for too long", but here are two different ways to do this:

julia> using BSON

julia> url = "https://raw.githubusercontent.com/JuliaIO/BSON.jl/master/test/test.bson"
"https://raw.githubusercontent.com/JuliaIO/BSON.jl/master/test/test.bson"

julia> open(BSON.load, `curl -s $url`)
Dict{Symbol,Any} with 2 entries:
  :a => Complex{Int64}[1+2im, 3+4im]
  :b => "Hello, World!"

julia> BSON.load(IOBuffer(read(`curl -s $url`)))
Dict{Symbol,Any} with 2 entries:
  :a => Complex{Int64}[1+2im, 3+4im]
  :b => "Hello, World!"

The first version is similar to your first version but closes the curl process immediately when done downloading. The second version reads the result of the curl call into a byte vector, wraps it in an IOBuffer and then calls BSON.load on that.

StefanKarpinski
  • 32,404
  • 10
  • 86
  • 111
  • Oh, the `open(BSON.load, `curl -s $url`)` closes immediately? That's great. Just to be sure, is it somwehere in docs? I had difficulty understanding when the process is kept open until whole buffer is read and when It's closed immediately. – Matěj Račinský Sep 10 '20 at 07:15
  • Well, it closes immediately after `BSON.load` completes. In general `open(f, path)` calls `f` on the open file handle and then closes afterwards whether `f` returns or errors. It's the first documented method of `open` if you do `?open`. – StefanKarpinski Sep 10 '20 at 21:36
  • Yeah, we had problems with curl crashing when we had some consecutive parsing after it, like that, and sometimes connection was interrupted before the parsing finished. – Matěj Račinský Sep 11 '20 at 09:06
  • You probably want the second onein that case since it reads all the data into a vector at once and then wraps that vector in an IO interface to pass it to the parsing code. – StefanKarpinski Sep 11 '20 at 15:39
  • Yes, exactly, I used the second solution and it solved most of these obscure problems. – Matěj Račinský Sep 11 '20 at 21:09