I am porting this Python code...
with open(filename, 'r') as f:
results = [np.array(line.strip().split(' ')[:-1], float)
for line in filter(lambda l: l[0] != '#', f.readlines())]
...to Julia. I came up with:
results = [map(ss -> parse(Float64, ss), split(s, " ")[1:end-1])
for s in filter(s -> s[1] !== '#', readlines(filename))];
The main reason for this porting is a potential performance gain, so I timed the two snippets in a Jupyter notebook:
- using
%%timeit
...- Python:
12.8 ms ± 44.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
- Julia:
@benchmark
returns (among other things)mean time: 8.250 ms (2.62% GC)
. So far so good; I do get a performance boost.
- Python:
- However, when using
@time
:- I get something in the lines of
0.103095 seconds (130.44 k allocations: 11.771 MiB, 91.58% compilation time)
. From this thread I inferred that it was probably caused by my -> function declarations.
- I get something in the lines of
Indeed, if I replace my code with:
filt = s -> s[1] !== '#';
pars = ss -> parse(Float64, ss);
res = [map(pars, split(s, " ")[1:end-1])
for s in filter(filt, readlines(filename))];
and time only the last line, I get a more encouraging 0.073007 seconds (60.58 k allocations: 7.988 MiB, 88.33% compilation time)
; hurray! However, it kind of defeats the purpose (at least as I understand it) of anonymous functions and might lead to a bunch of f1, f2, f3, ... Giving a name to my Python lambda function out of the list comprehension does not seem to affect the Python's runtime.
My question is: to get normal performances, should I systematically name my Julia functions? Note that this particular snippet is to be called in a loop over ~30k files. (Basically, what I am doing is reading files that are mixtures of space-separated floats and comment lines; each float line can have a different length, and I am not interested in the last element of the line. Any comments on my solutions are appreciated.)
(Side comment: wrapping s
with strip
completely messes up @benchmark
, adding 10 ms to the mean, but does not seem to affect @time
. Any reason why?)
Putting everything in a function as suggested by DNF fixes my "have to name my anonymous functions" problem. Using one of Vincent Yu's formulations:
function results(filename::String)::Vector{Vector{Float64}}
[[parse(Float64, s) for s in @view split(line, ' ')[1:end-1]]
for line in Iterators.filter(!startswith('#'), eachline(filename))]
end
@benchmark results(FN)
BenchmarkTools.Trial:
memory estimate: 3.74 MiB
allocs estimate: 1465
--------------
minimum time: 7.108 ms (0.00% GC)
median time: 7.458 ms (0.00% GC)
mean time: 7.580 ms (1.58% GC)
maximum time: 9.538 ms (14.84% GC)
--------------
samples: 659
evals/sample: 1
@time called on this function returns equivalent results after the first compilation run. I am happy with that.
However, this is my persisting issue with strip:
function results_strip(filename::String)::Vector{Vector{Float64}}
[[parse(Float64, s) for s in @view split(strip(line), ' ')[1:end-1]]
for line in Iterators.filter(!startswith('#'), eachline(filename))]
end
@benchmark results_strip(FN)
BenchmarkTools.Trial:
memory estimate: 3.74 MiB
allocs estimate: 1465
--------------
minimum time: 15.155 ms (0.00% GC)
median time: 15.742 ms (0.00% GC)
mean time: 15.885 ms (0.75% GC)
maximum time: 19.089 ms (10.02% GC)
--------------
samples: 315
evals/sample: 1
The median time doubles. If I look at strip only:
function only_strip(filename::String)
[strip(line) for line in Iterators.filter(!startswith('#'), eachline(filename))]
end
@benchmark only_strip(FN)
BenchmarkTools.Trial:
memory estimate: 1.11 MiB
allocs estimate: 475
--------------
minimum time: 223.868 μs (0.00% GC)
median time: 258.227 μs (0.00% GC)
mean time: 325.389 μs (9.41% GC)
maximum time: 56.024 ms (75.09% GC)
--------------
samples: 10000
evals/sample: 1
Figures just do not add up. Could there be a type mismatch, should I cast the results to something else?