5

I'm currently very interested in learning Elixir and my typical approach of learning a new language is to build simple programs with it.

So I decided to write a (very simple) grep-like program (or module) which is shown here:

defmodule LineMatch do
  def file(s, path) do
    case File.open(path, [:read]) do
      {:ok, fd} -> match s, fd
      {_, error} -> IO.puts "#{:file.format_error error}"
    end
  end
  defp match(s, fd) do
    case IO.read fd, :line do
      {:error, error} -> IO.puts("oh noez! Error: #{error}")
      line -> match(s, line, fd)
    end
  end
  defp match(s, line, fd) when line !== :eof do
    if String.contains?(line, s) do
      IO.write line
    end
    match(s, IO.read(fd, :line), fd)
  end
  defp match(_, _line, _) when _line === :eof do
  end
end

This is most probably not the most elegant way to do it and I'm also quite new to functional programming, so I didn't expect it to be super fast. But it is not only not fast, it is actually super slow. So slow that I probably did some super obvious mistake.

Can anyone tell me, what it is and how to make it better?

I typically test the code with a seperate .exs file like

case System.argv do
  [searchTerm, path] -> LineMatch.file(searchTerm, path)
  _ -> IO.puts "Usage: lineMatch searchTerm path"
end
koehr
  • 769
  • 1
  • 8
  • 20

2 Answers2

6

Rather than reading in the whole file as in lad2025's answer, you can get good performance by doing two things. First, use IO.binstream to build a stream of the file's lines, but as raw binary (for performance). Using IO.stream reads as UTF-8, and as such incurs extra cost to do the conversion as you are reading the file. If you need UTF-8 conversion, then it's going to be slow. Additionally, applying the filtering and mapping operations using the Stream.filter/2 and Stream.map/2 functions prevents you from iterating over the lines multiple times.

defmodule Grepper do
  def grep(path, str) do
    case File.open(path) do
      {:error, reason} -> IO.puts "Error grepping #{path}: #{reason}"
      {:ok, file} ->
        IO.binstream(file, :line)
        |> Stream.filter(&(String.contains?(&1, str)))
        |> Stream.map(&(IO.puts(IO.ANSI.green <> &1 <> IO.ANSI.reset)))
        |> Stream.run
    end
  end
end

I suspect the greatest issue with your code is the UTF-8 conversion, but it may be that by "pulling" from the file line by line by calling IO.read, rather than having the lines "pushed" to your filtering/printing operations using IO.stream|binstream, you are incurring some extra cost there. I'd have to look at Elixir's source to know for sure, but the above code was quite performant on my machine (I was searching a 143kb file from the Olson timezone database, not sure how it will perform on files of very large size as I don't have a good sample file handy).

bitwalker
  • 9,061
  • 1
  • 35
  • 27
  • Thank you for your very detailed answer! I will definitely try your suggestion. I used http://www.generatedata.com/ to generate example data. I still wonder why it is THAT slow. Sure, UTF-8 conversion slows it down but I got output like one line per 500ms. – koehr Jul 14 '15 at 09:00
  • I really doubt it's the UTF-8 conversion. Dealing with files via File.read is very slow since every I/O transaction is a message between processes in elixir. It's almost always faster to read one large chunk and then parse that single binary. – Fred the Magic Wonder Dog Jul 16 '15 at 21:26
1

Using of File.stream! will be much more efficient. Try it:

defmodule Grepper do
  def grep(path, str) do
    File.stream!(path)
      |> Stream.filter(&(String.contains?(&1, str)))
      |> Stream.map(&(IO.puts(IO.ANSI.green <> &1 <> IO.ANSI.reset)))
      |> Stream.run
  end
end
Roman Smirnov
  • 533
  • 2
  • 9
  • By far the quickest. :timer.tc says 2600 to 3000ms, while the binstream variant from bitwalker takes between 6400 and 6900ms. Now I'll play with utf-8 conversion to see if this is the source of that original slowness. – koehr Jul 14 '15 at 12:26
  • So using File.steam!(path, [:utf8]) instead didn't change the speed at all. I still don't get why it is so slow what I did. I see that is slower, yes. But slow as in 500ms between each line output? – koehr Jul 14 '15 at 12:57
  • That's really really strange, as both `IO.stream` and `File.stream` do effectively the same thing (`IO.stream` is abstracted use any IO input, but the implementation of reading the stream is basically identical). Strange that there would be such a large difference between Roman's solution and mine, but I'll have to keep it in mind. `File.stream!` doesn't have a :utf8 mode though, you have to pass `:raw` to simulate my `binstream` implementation, as `File.stream!` does UTF-8 conversion by default from what I can see. – bitwalker Jul 14 '15 at 16:49
  • I tried the new implementations on my 900MB test data set and it is still remarkably slow. While my original implementation took 39 seconds, the File.stream(_, [:raw]) version still takes 12 (without :raw, it takes just 1-2 seconds more, but IO.binstream is super slow, too with 38s).Just using the original grep in the command line takes half a second. – koehr Jul 15 '15 at 01:05
  • 1
    I think, that is not correct to compare such the simple algorithm with the original grep. You could look inside [grep source code](http://git.savannah.gnu.org/cgit/grep.git/tree/src/grep.c) and ensure, that it implements completely different algorithm. So, to get comparable results we will have to reimplement the same algorithm in Elixir. – Roman Smirnov Jul 16 '15 at 13:44
  • File.stream! opens the file with :raw and :read_ahead, IO.binstream only reads as many bytes as to the next line. – Fred the Magic Wonder Dog Jul 16 '15 at 21:33