Most efficient way to search file for byte patterns in Elixir

Question

I am searching for id3 tags in a song file. A file can have id3v1, id3v1 extended tags (located at the end of the file) as well as id3v2 tags (usually located at the beginning). For the id3v1 tags, I can use File.read(song_file) and pull out the last 355 bytes (128 + 227 for the extended tag). However, for the id3v2 tags, I need to search through the file from the beginning looking for a 10 byte id3v2 pattern. I want to avoid any overhead from opening and closing the same file repeatedly as I search for the different tags, so I thought the best way would be to use File.stream!(song_file) and send the file stream to different functions to search for the different tags.

def parse(file_name) do
  file_stream = File.stream!(file_name, [], 1)
  id3v1_tags(file_stream)
  |> add_tags(id3v2_tags(file_stream))
end

def id3v1_tags(file_stream) do
  tags = Tags%{} #struct containing desired tags
  << id3_extended_tag :: binary-size(227), id3_tag :: binary-size(128) >> = Stream.take(file_stream, -355)
  id3_tag = to_string(id3_tag)
  if String.slice(id3_tag,0, 3) == "TAG" do
    Map.put(tags, :title, String.slice(id3_tag, 3, 30))
    Map.put(tags, :track_artist, String.slice(id3_tag, 33, 30))
    ...
  end
  if String.slice(id3_extended_tag, 0, 4) == "TAG+" do
    Map.put(tags, :title, tags.title <> String.slice(id3_extended_tag, 4, 60))
    Map.put(tags, :track_artist, tags.track_artist <> String.slice(id3_extended_tag, 64, 60))
    ...
  end
end

def id3v2_tags(file_stream) do
  search for pattern:
  <<0x49, 0x44, 0x33, version1, version2, flags, size1, size2, size3, size4>>
end

1) Am I saving any runtime by creating the File.stream! once and sending it to the different functions (I will be scanning tens of thousands of files, so saving a bit of time is important)? Or should I just use File.read for the id3v1 tags and File.stream! for the id3v2 tags?

2) I get an error in the line:

  << id3_extended_tag :: binary-size(227), id3_tag :: binary-size(128) >> = Stream.take(file_stream, -355)

because Stream.take(file_stream, -355) is a function, not a binary. How do I turn it into a binary that I can pattern match?

I don't have an answer for you but if I were you, I'd really avoid magic numbers. I'm assuming the -355 is the sum of 128 and 227? If I were you I'd set a module attribute called `@id_extended_tag_binsize` to 227 and `@id3_tag_binsize` to 128 and then use those attributes in the Stream.take like this: Stream.take(file_stream, -(`@id_extended_tag_binsize` + `@id3_tag_binsize`)) Less error prone and far easier to read and understand. — Onorio Catenacci, May 27 '15 at 17:18
First off, File.stream needs to be told to read bytes rather than lines using the syntax File.stream!(path, [:read ], :bytes ) . Next I don't think the -355 can be used with Stream.take since stream is not a list but a potential list that needs to be finalised with a Stream.to_list call. — GavinBrelstaff, May 27 '15 at 17:42
Thanks for the suggestions. I have changed my code accordingly. — Paul B, May 27 '15 at 20:21

score 8 · Accepted Answer · answered May 27 '15 at 18:00

I believe your implementation is being unnecessarily complex due to the reliance on stream. Make it work, make it pretty then make it fast (but only if necessary).

For simplicity, I would first load everything into memory. Just use File.read!/1. Then you can use the functions in :binary module to search for patterns (:binary.match/2), split it (:binary.split/2) or grab a certain part (:binary.part/3). There is no need to mix File.stream and File.read too, just read it once and pass that same binary around.

Also, very important, don't use the String module. String is meant to work UTF-8 encoded binaries. You want to use the :binary module for all byte level operation.

Finally, Stream.take/2 always returns functions as it is lazy. You want to use Enum.take/2 instead (it accepts streams as streams are also enumerables). Although, as I said, I would skip the stream stuff altogether.

I appreciate the answer. I will take your advice and use File.read!/1. I didn't know that about the String module. I thought that if the bytes were characters, then you could treat them as either binaries or strings. — Paul B, May 27 '15 at 20:26

Most efficient way to search file for byte patterns in Elixir

1 Answers1