1

There is a large (does not fit in memory) .json file with the following content:

[{
    "doc_number": "xxx",
    "other": "data"
}, {
    "doc_number": "yyy",
    "other": "data"
}, {
    "doc_number": "zzz",
    "other": "data"
}]

I would like to read it as fast as possible using as little memory as possible. In other languages I usually create a lazy sequence of the file and read only when necessary. I was wondering if Erlang has an idiomatic way of achieving that.

Istvan
  • 7,500
  • 9
  • 59
  • 109
  • There are quite a few libraries for JSON, each taking a different approach based on the tradeoffs selected. From what you say you want "low memory use, fast translation", you might want Jiffy -- but use Erlang's own facilities to guarantee you seek toward and *only* read the absolute minimum amount of data in at a time. The most important thing you'll hear (and the flood of advice you'd get if you asked on the ML, btw) is to not assume you've got a performance problem that mandates approach X over approach Y *until you've measured a few approaches* and actually verify the performance profile. – zxq9 Dec 09 '15 at 09:59

1 Answers1

4

jsx can be used as incremental parser, but for your format of data you have to write your own callback module:

-module(jsx_increment).

-export([parse_file/1]).

-export([init/1, handle_event/2]).

parse_file(FN) ->
    {ok, File} = file:open(FN, [read, raw, binary]),
    read(File, jsx:decoder(?MODULE, [], [stream, return_tail])),
    file:close(File).

read(File, JSX) ->
    {ok, Data} = file:read(File, 8), %% eof should raise error
    case JSX(Data) of
        {incomplete, F} ->
            read(File, F);
        {with_tail, _, Tail} ->
            Tail =/= <<>> andalso io:format("Surplus content: ~s~n", [Tail])
    end.

init(_) ->
    start.

handle_event(start_array, start) ->
    [];
handle_event(_, start) ->
    error(expect_array);
handle_event(start_object, L) ->
    [start_object|L];
handle_event(start_array, L) ->
    [start_array|L];
handle_event(end_object, L) ->
    check_out(collect_object(L));
handle_event(end_array, []) ->
    stop;
handle_event(end_array, L) ->
    check_out(collect_array(L));
handle_event(E, L) ->
    check_out([event(E)|L]).

check_out([X]) ->
    io:format("Collected object: ~p~n", [X]),
    [];
check_out(L) -> L.

event({_, X}) -> X;
event(X) -> X.

collect_object(L) ->
    collect_object(L, #{}).

collect_object([start_object|T], M) ->
    [M|T];
collect_object([V, K|T], M) ->
    collect_object(T, M#{K => V}).

collect_array(L) ->
    collect_array(L, []).

collect_array([start_array|T], L) ->
    [L|T];
collect_array([H|T], L) ->
    collect_array(T, [H|L]).

And your example:

1> io:put_chars(element(2, file:read_file("data.json"))).
[{
    "doc_number": "xxx",
    "other": "data"
}, {
    "doc_number": "yyy",
    "other": "data"
}, {
    "doc_number": "zzz",
    "other": "data"
}]
ok
2> jsx_increment:parse_file("data.json").
Collected object: #{<<"doc_number">> => <<"xxx">>,<<"other">> => <<"data">>}
Collected object: #{<<"doc_number">> => <<"yyy">>,<<"other">> => <<"data">>}
Collected object: #{<<"doc_number">> => <<"zzz">>,<<"other">> => <<"data">>}
ok

It is proof of concept code which you have to adapt to your use case anyway, handle errors and so. (Used maps handling works only from R18. Use maps:put(K, V, M) for R17 and proplist for pre R17.)

Hynek -Pichi- Vychodil
  • 26,174
  • 5
  • 52
  • 73