One solution is to match the parts of the binary you're looking for:
Data = <<"SPAMD/1.1 0 EX_OK\r\nContent-length: 728\r\nSpam: True ; 6.3 / 5.0\r\n\r\nReceived: from localhost by debpub1.cs2cloud.internal\r\n\twith SpamAssassin (version 3.4.2);\r\n\tSat, 04 Jan 2020 18:24:37 +0100\r\nFrom: bibi <bibi@XXXXX.local>\r\nTo: <aZphki8N05@XXXXXXXX>\r\nSubject: i\r\nDate: Sat, 4 Jan 2020 18:24:36 +0100\r\nMessage-Id: <3b68dede-f1c3-4f04-62dc-f0b2de6e980a@PPPPPP.local>\r\nX-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on\r\n\tdebpub1.cs2cloud.internal\r\nX-Spam-Flag: YES\r\nX-Spam-Level: ******\r\nX-Spam-Status: Yes, score=6.3 required=5.0 tests=BODY_SINGLE_WORD,\r\n\tDKIM_ADSP_NXDOMAIN,DOS_RCVD_IP_TWICE_C,HELO_MISC_IP,\r\n\tNO_FM_NAME_IP_HOSTN autolearn=no autolearn_force=no version=3.4.2\r\nMIME-Version: 1.0\r\nContent-Type: multipart/mixed; boundary=\"----------=_5E10CA56.0200B819\"\r\n\r\n">>,
Matches = binary:compile_pattern([<<"BODY_SINGLE_WORD">>,<<"DKIM_ADSP_NXDOMAIN">>,<<"DOS_RCVD_IP_TWICE_C">>,<<"HELO_MISC_IP">>,<<"NO_FM_NAME_IP_HOSTN">>]),
[binary:part(Data, PosLen) || PosLen <- binary:matches(Data, Matches)].
Executing the three lines above in an Erlang shell returns:
[<<"BODY_SINGLE_WORD">>,<<"DKIM_ADSP_NXDOMAIN">>, <<"DOS_RCVD_IP_TWICE_C">>,<<"HELO_MISC_IP">>, <<"NO_FM_NAME_IP_HOSTN">>]
This provides the desired result, but it might not be safe since it doesn't do anything to try to verify whether the input is valid or whether the matches occur on valid boundaries.
A potentially safer approach relies on the fact that the input binary resembles an HTTP result, and so it can be partially parsed with built-in Erlang decoders. The parse/1,2
functions below use erlang:decode_packet/3
to extract information from the input:
parse(Data) ->
{ok, Line, Rest} = erlang:decode_packet(line, Data, []),
parse(Line, Rest).
parse(<<"SPAMD/", _/binary>>, Data) ->
parse(Data, []);
parse(<<>>, Hdrs) ->
Result = [{Key,Value} || {http_header, _, Key, _, Value} <- Hdrs],
process_results(Result);
parse(Data, Hdrs) ->
case erlang:decode_packet(httph, Data, []) of
{ok, http_eoh, Rest} ->
parse(Rest, Hdrs);
{ok, Hdr, Rest} ->
parse(Rest, [Hdr|Hdrs]);
Error ->
Error
end.
The parse/1
function initially decodes the first line of the input using the line
decoder, passing the results to parse/2
. The first clause of parse/2
matches the "SPAMD/"
prefix of the initial line of the input data just to verify we're looking in the right place, then recursively invokes parse/2
passing the remaining Data
and an empty accumulator list. The second and third clauses of parse/2
treat the data as HTTP headers. The second clause of parse/2
matches when the input data is exhausted; it maps the accumulated header list to a list of {Key,Value}
pairs and passes it to a process_results/1
function, described below, to finish the data extraction. The third clause of parse/2
tries to decode the data via the httph
HTTP header decoder, accumulating each matched header and ignoring any http_eoh
end-of-headers markers that result from "\r\n"
sequences embedded at odd places in the input.
For the input data provided in the question, the parse/1,2
functions ultimately pass the following list of key-value pairs to process_results/1
:
[{'Content-Type',"multipart/mixed; boundary=\"----------=_5E10CA56.0200B819\""},{"Mime-Version","1.0"},{"X-Spam-Status","Yes, score=6.3 required=5.0 tests=BODY_SINGLE_WORD,\r\n\tDKIM_ADSP_NXDOMAIN,DOS_RCVD_IP_TWICE_C,HELO_MISC_IP,\r\n\tNO_FM_NAME_IP_HOSTN autolearn=no autolearn_force=no version=3.4.2"},{"X-Spam-Level","******"},{"X-Spam-Flag","YES"},{"X-Spam-Checker-Version","SpamAssassin 3.4.2 (2018-09-13) on\r\n\tdebpub1.cs2cloud.internal"},{"Message-Id","<3b68dede-f1c3-4f04-62dc-f0b2de6e980a@PPPPPP.local>"},{'Date',"Sat, 4 Jan 2020 18:24:36 +0100"},{"Subject","i"},{"To","<aZphki8N05@XXXXXXXX>"},{'From',"bibi <bibi@XXXXX.local>"},{"Received","from localhost by debpub1.cs2cloud.internal\r\n\twith SpamAssassin (version 3.4.2);\r\n\tSat, 04 Jan 2020 18:24:37 +0100"},{"Spam","True ; 6.3 / 5.0"},{'Content-Length',"728"}]
The process_results/1,2
functions first match the key of interest, which is "X-Spam-Status"
, and then extract the desired data from its value. The three functions below implement process_results/1
to look for that key and process it, or return {error, not_found}
if no such key is seen. The second clause matches the desired key, splits its associated value on the space, comma, carriage return, newline, tab, and equal sign characters, and passes the split result along with an empty accumulator to process_results/2
:
process_results([]) ->
{error, not_found};
process_results([{"X-Spam-Status", V}|_]) ->
process_results(string:lexemes(V, " ,\r\n\t="), []);
process_results([_|T]) ->
process_results(T).
For the input data in the question, the list of strings passed to process_results/2
is
["Yes","score","6.3","required","5.0","tests","BODY_SINGLE_WORD","\r\n","DKIM_ADSP_NXDOMAIN","DOS_RCVD_IP_TWICE_C","HELO_MISC_IP","\r\n","NO_FM_NAME_IP_HOSTN","autolearn","no","autolearn_force","no","version","3.4.2"]
The clauses of process_results/2
below recursively walk this list of strings and accumulate the matched results. Each of the second through sixth clauses matches one of the values we seek, and each converts the matched string to a binary before accumulating it.
process_results([], Results) ->
{ok, lists:reverse(Results)};
process_results([V="BODY_SINGLE_WORD"|T], Results) ->
process_results(T, [list_to_binary(V)|Results]);
process_results([V="DKIM_ADSP_NXDOMAIN"|T], Results) ->
process_results(T, [list_to_binary(V)|Results]);
process_results([V="DOS_RCVD_IP_TWICE_C"|T], Results) ->
process_results(T, [list_to_binary(V)|Results]);
process_results([V="HELO_MISC_IP"|T], Results) ->
process_results(T, [list_to_binary(V)|Results]);
process_results([V="NO_FM_NAME_IP_HOSTN"|T], Results) ->
process_results(T, [list_to_binary(V)|Results]);
process_results([_|T], Results) ->
process_results(T, Results).
The final clause ignores unwanted data. The first clause of process_results/2
is invoked when the list of strings is empty, and it just returns the reversed accumulator. For the input data in the question, the final result of process_results/2
is:
{ok, [<<"BODY_SINGLE_WORD">>,<<"DKIM_ADSP_NXDOMAIN">>,<<"DOS_RCVD_IP_TWICE_C">>,<<"HELO_MISC_IP">>,<<"NO_FM_NAME_IP_HOSTN">>]}