4

I am reading postcodes from a csv file, taking that data and caching it with ets.

The postcode file is quite large (95MB) as it contains about 1.8 million entries.

I am only caching the postcodes that are needed for look ups at the moment (about 200k) so the amount of data stored in ets should not be an issue. However no matter how small the number of inserts into ets is, the amount of memory taken up by the process is virtually unchanged. Doesn't seem to matter if I insert 1 row or all 1.8 million.

# not logging all functions defs so it is not too long.
# Comment if more into is needed.
defmodule PostcodeCache do
  use GenServer

  def cache_postcodes do
    "path_to_postcode.csv"
    |> File.read!()
    |> function_to_parse()
    |> function_to_filter()
    |> function_to_format()
    |> Enum.map(&(:ets.insert_new(:cache, &1)))
  end
end

I am running this in the terminal with iex -S mix and running the command :observer.start. When I go to the processes tab, my postcodeCache memory is massive (over 600MB)

Even if I filter the file so I only end up storing 1 postcode in :ets it is still over 600MB.

RobStallion
  • 1,624
  • 1
  • 15
  • 28
  • Can you show the logic of function_to_parse, filter and format? I think it comes from that. – TheAnh Jan 09 '19 at 15:05
  • If you're doing the parsing from inside the gen_server itself you might want to call hibernate once the parsing is done, so that GC is executed, that might help it? – m3characters Jan 09 '19 at 15:19
  • @m3characters I haven't come across that before but I'll take a look and see if it can be used here . Thanks – RobStallion Jan 09 '19 at 15:34

1 Answers1

6

I realised that the error I was making was when I was looking at the memory of the process and assuming that it was to do with the cache.

Because this is a GenServer it is holding onto all the information from csv file when it is read (File.read!) and also appears to be holding onto all changes made to that file as well.

How I have solved this is by changing the File.read! to a File.stream!. I then use Enum.each instead of mapping over the returned data.

In the each I check the postcode is what I want and if it is I then insert it into ets.

def cache_postcodes do
  "path_to_postcode.csv"
  |> File.stream!()
  |> Enum.each(fn(line) ->
    value_to_store = some_check_on_line(line)
    :ets.insert_new(:cache, &1)
  end)
end

With this approach my processes memory is now only about 2MB (not 632MB) and my ets memory is about 30MB. That is about what I would expect.

RobStallion
  • 1,624
  • 1
  • 15
  • 28
  • 1
    This makes a little sense though. The correct approach would be to use [`Stream.each/2`](https://hexdocs.pm/elixir/master/Stream.html#each/2) and terminate the stream afterward with `Enum.to_list/1`. – Aleksei Matiushkin Jan 17 '19 at 16:37
  • Oh cool. Thanks for this. I'll have a look into this option and see if I can get this to work. Will comment and/or update the answer if I can get this to work in my example – RobStallion Jan 17 '19 at 16:42