Tricky pattern matching of a binary string in Erlang

Question

I am using Erlang to send message between an email server and Spamassassin.

What I want to achieve is retrieving the tests done by SA to generate a report (I am doing some kind of mail-tester program)

When SpamAssassin answers (through raw TCP) it sends a binary string like this one: enter image description here

<<"SPAMD/1.1 0 EX_OK\r\nContent-length: 728\r\nSpam: True ; 6.3 / 5.0\r\n\r\nReceived: from localhost by debpub1.cs2cloud.internal\r\n\twith SpamAssassin (version 3.4.2);\r\n\tSat, 04 Jan 2020 18:24:37 +0100\r\nFrom: bibi <bibi@XXXXX.local>\r\nTo: <aZphki8N05@XXXXXXXX>\r\nSubject: i\r\nDate: Sat, 4 Jan 2020 18:24:36 +0100\r\nMessage-Id: <3b68dede-f1c3-4f04-62dc-f0b2de6e980a@PPPPPP.local>\r\nX-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on\r\n\tdebpub1.cs2cloud.internal\r\nX-Spam-Flag: YES\r\nX-Spam-Level: ******\r\nX-Spam-Status: Yes, score=6.3 required=5.0 tests=BODY_SINGLE_WORD,\r\n\tDKIM_ADSP_NXDOMAIN,DOS_RCVD_IP_TWICE_C,HELO_MISC_IP,\r\n\tNO_FM_NAME_IP_HOSTN autolearn=no autolearn_force=no version=3.4.2\r\nMIME-Version: 1.0\r\nContent-Type: multipart/mixed; boundary=\"----------=_5E10CA56.0200B819\"\r\n\r\n">>

I put in bold the items I want to pick up:

BODY_SINGLE_WORD
DKIM_ADSP_NXDOMAIN
DOS_RCVD_IP_TWICE_C
HELO_MISC_IP
NO_FM_NAME_IP_HOSTN

I then want to serialize like that: [<<"DKIM_ADSP_NXDOMAIN">>,<<"DOS_RCVD_IP_TWICE_C">>,…]

But that's not easy, terms have no regular "delimitors", has \r\n or \r\n\t

I give a start with that expression (splitting on ',' on a binary string) but result is incomplete

split(BinaryString, ",", all),
case lists:member(<<"HELO_MISC_IP">>, Data3 ) of
            true -> ; %push the result in a database
            false -> ok
end;

I wish I could take another start, and using looping through recursion (and becausee it is a clean and nice way to loop) but it looks pointless to me regarding that scenario …

split(BinaryString, Idx, Acc) ->
case BinaryString of
    <<"tests=",_This:Idx/binary, Char, Tail/binary>> ->
                case lists:member(Char, BinaryString ) of
                    false ->
                        split(BinaryString, Idx+1, Acc);
                    true -> 
                           case Tail of
                                    <<Y/binary, _Tail/binary>> ->
                                    %doing something
                                    <<_Yop2/binary>> ->
                                    %doing somethin else
                           end
                 end;

The thing is I don't see how achieve this in a acceptable and clean way

If anyone could give me a hand that would be very very appreciable.

Yours

score 5 · Accepted Answer · answered Jan 05 '20 at 14:42

One solution is to match the parts of the binary you're looking for:

Data = <<"SPAMD/1.1 0 EX_OK\r\nContent-length: 728\r\nSpam: True ; 6.3 / 5.0\r\n\r\nReceived: from localhost by debpub1.cs2cloud.internal\r\n\twith SpamAssassin (version 3.4.2);\r\n\tSat, 04 Jan 2020 18:24:37 +0100\r\nFrom: bibi <bibi@XXXXX.local>\r\nTo: <aZphki8N05@XXXXXXXX>\r\nSubject: i\r\nDate: Sat, 4 Jan 2020 18:24:36 +0100\r\nMessage-Id: <3b68dede-f1c3-4f04-62dc-f0b2de6e980a@PPPPPP.local>\r\nX-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on\r\n\tdebpub1.cs2cloud.internal\r\nX-Spam-Flag: YES\r\nX-Spam-Level: ******\r\nX-Spam-Status: Yes, score=6.3 required=5.0 tests=BODY_SINGLE_WORD,\r\n\tDKIM_ADSP_NXDOMAIN,DOS_RCVD_IP_TWICE_C,HELO_MISC_IP,\r\n\tNO_FM_NAME_IP_HOSTN autolearn=no autolearn_force=no version=3.4.2\r\nMIME-Version: 1.0\r\nContent-Type: multipart/mixed; boundary=\"----------=_5E10CA56.0200B819\"\r\n\r\n">>,
Matches = binary:compile_pattern([<<"BODY_SINGLE_WORD">>,<<"DKIM_ADSP_NXDOMAIN">>,<<"DOS_RCVD_IP_TWICE_C">>,<<"HELO_MISC_IP">>,<<"NO_FM_NAME_IP_HOSTN">>]),
[binary:part(Data, PosLen) || PosLen <- binary:matches(Data, Matches)].

Executing the three lines above in an Erlang shell returns:

[<<"BODY_SINGLE_WORD">>,<<"DKIM_ADSP_NXDOMAIN">>, <<"DOS_RCVD_IP_TWICE_C">>,<<"HELO_MISC_IP">>, <<"NO_FM_NAME_IP_HOSTN">>]

This provides the desired result, but it might not be safe since it doesn't do anything to try to verify whether the input is valid or whether the matches occur on valid boundaries.

A potentially safer approach relies on the fact that the input binary resembles an HTTP result, and so it can be partially parsed with built-in Erlang decoders. The parse/1,2 functions below use erlang:decode_packet/3 to extract information from the input:

parse(Data) ->
    {ok, Line, Rest} = erlang:decode_packet(line, Data, []),
    parse(Line, Rest).
parse(<<"SPAMD/", _/binary>>, Data) ->
    parse(Data, []);
parse(<<>>, Hdrs) ->
    Result = [{Key,Value} || {http_header, _, Key, _, Value} <- Hdrs],
    process_results(Result);
parse(Data, Hdrs) ->
    case erlang:decode_packet(httph, Data, []) of
        {ok, http_eoh, Rest} ->
            parse(Rest, Hdrs);
        {ok, Hdr, Rest} ->
            parse(Rest, [Hdr|Hdrs]);
        Error ->
            Error
    end.

The parse/1 function initially decodes the first line of the input using the line decoder, passing the results to parse/2. The first clause of parse/2 matches the "SPAMD/" prefix of the initial line of the input data just to verify we're looking in the right place, then recursively invokes parse/2 passing the remaining Data and an empty accumulator list. The second and third clauses of parse/2 treat the data as HTTP headers. The second clause of parse/2 matches when the input data is exhausted; it maps the accumulated header list to a list of {Key,Value} pairs and passes it to a process_results/1 function, described below, to finish the data extraction. The third clause of parse/2 tries to decode the data via the httph HTTP header decoder, accumulating each matched header and ignoring any http_eoh end-of-headers markers that result from "\r\n" sequences embedded at odd places in the input.

For the input data provided in the question, the parse/1,2 functions ultimately pass the following list of key-value pairs to process_results/1:

[{'Content-Type',"multipart/mixed; boundary=\"----------=_5E10CA56.0200B819\""},{"Mime-Version","1.0"},{"X-Spam-Status","Yes, score=6.3 required=5.0 tests=BODY_SINGLE_WORD,\r\n\tDKIM_ADSP_NXDOMAIN,DOS_RCVD_IP_TWICE_C,HELO_MISC_IP,\r\n\tNO_FM_NAME_IP_HOSTN autolearn=no autolearn_force=no version=3.4.2"},{"X-Spam-Level","******"},{"X-Spam-Flag","YES"},{"X-Spam-Checker-Version","SpamAssassin 3.4.2 (2018-09-13) on\r\n\tdebpub1.cs2cloud.internal"},{"Message-Id","<3b68dede-f1c3-4f04-62dc-f0b2de6e980a@PPPPPP.local>"},{'Date',"Sat, 4 Jan 2020 18:24:36 +0100"},{"Subject","i"},{"To","<aZphki8N05@XXXXXXXX>"},{'From',"bibi <bibi@XXXXX.local>"},{"Received","from localhost by debpub1.cs2cloud.internal\r\n\twith SpamAssassin (version 3.4.2);\r\n\tSat, 04 Jan 2020 18:24:37 +0100"},{"Spam","True ; 6.3 / 5.0"},{'Content-Length',"728"}]

The process_results/1,2 functions first match the key of interest, which is "X-Spam-Status", and then extract the desired data from its value. The three functions below implement process_results/1 to look for that key and process it, or return {error, not_found} if no such key is seen. The second clause matches the desired key, splits its associated value on the space, comma, carriage return, newline, tab, and equal sign characters, and passes the split result along with an empty accumulator to process_results/2:

process_results([]) ->
    {error, not_found};
process_results([{"X-Spam-Status", V}|_]) ->
    process_results(string:lexemes(V, " ,\r\n\t="), []);
process_results([_|T]) ->
    process_results(T).

For the input data in the question, the list of strings passed to process_results/2 is

["Yes","score","6.3","required","5.0","tests","BODY_SINGLE_WORD","\r\n","DKIM_ADSP_NXDOMAIN","DOS_RCVD_IP_TWICE_C","HELO_MISC_IP","\r\n","NO_FM_NAME_IP_HOSTN","autolearn","no","autolearn_force","no","version","3.4.2"]

The clauses of process_results/2 below recursively walk this list of strings and accumulate the matched results. Each of the second through sixth clauses matches one of the values we seek, and each converts the matched string to a binary before accumulating it.

process_results([], Results) ->
    {ok, lists:reverse(Results)};
process_results([V="BODY_SINGLE_WORD"|T], Results) ->
    process_results(T, [list_to_binary(V)|Results]);
process_results([V="DKIM_ADSP_NXDOMAIN"|T], Results) ->
    process_results(T, [list_to_binary(V)|Results]);
process_results([V="DOS_RCVD_IP_TWICE_C"|T], Results) ->
    process_results(T, [list_to_binary(V)|Results]);
process_results([V="HELO_MISC_IP"|T], Results) ->
    process_results(T, [list_to_binary(V)|Results]);
process_results([V="NO_FM_NAME_IP_HOSTN"|T], Results) ->
    process_results(T, [list_to_binary(V)|Results]);
process_results([_|T], Results) ->
    process_results(T, Results).

The final clause ignores unwanted data. The first clause of process_results/2 is invoked when the list of strings is empty, and it just returns the reversed accumulator. For the input data in the question, the final result of process_results/2 is:

{ok, [<<"BODY_SINGLE_WORD">>,<<"DKIM_ADSP_NXDOMAIN">>,<<"DOS_RCVD_IP_TWICE_C">>,<<"HELO_MISC_IP">>,<<"NO_FM_NAME_IP_HOSTN">>]}

@lambda79 please mark this answer as having solved your problem. — RichardC, Jan 09 '20 at 14:25

Tricky pattern matching of a binary string in Erlang

1 Answers1