tl;dr answer:
let panic_lambda =
|x: Result<DataRowStruct, csv::Error>| { x.expect("bad data") };
let data_table: std::collections::BTreeSet<DataRowStruct> =
csv_reader.deserialize().map(panic_lambda).collect();
As a general comment I have noticed that lots of Rust code which developers tend to write becomes really quite difficult to understand as soon as iterators and results are combined.
For example with the CSV package, I previously saw this example, which by the standards of things I have seen elsewhere isn't too bad, but still isn't wonderful. (This is my friends solution by the way. I know him personally, and while this works, I think we can do better.)
let data_table: BTreeMap<IdType, DataRowStruct> = csv_reader
.deserialize()
.flatten()
.map(|data_row: DataRowStruct| {
assert!(... something);
assert!(... something);
(data_row.id, data_row)
})
.collect();
// elsewhere:
struct DataRowStruct {
pub something: Type,
... others
}
Can you understand what this is trying to do? I can't - at least not easily.
The code in this example isn't readable or clear because it tries to do too many things at once.
Let's count them:
- load a file from disk, it happens to be a CSV format text file
- use CSV Reader to deserialize the data into a predefined structure, defined by
DataRowStruct
flatten()
, whatever that does (?!) (see note)
- apply a lambda function (closure) using
map
which does some stuff including assert!
and some other stuff with parenthesis
- we can't really tell what the lambda function (closure) does without reading it slowly and carefully inspecting where the brackets and parenthesis are
- finally, we call
collect
, which we know what that does (builds a collection from an iterator)
Note: flatten
, I discovered, after some searching, has the effect of skipping blank lines and bad data. But this isn't at all obvious from the above code, because the purpose of flatten
is to take things such as nested collections and "flatten" them into a single collection with no nesting.
That's something like 6 responsibilities, all in a single line of code. (A single line of code split over multiple actual lines of text in the IDE, but it's still a single line as seen by the compiler.)
By contrast, this is what I came up with after some thinking about it.
let panic_lambda =
|x: Result<DataRowStruct, csv::Error>| { x.expect("bad data") };
let data_table: std::collections::BTreeSet<DataRowStruct> =
csv_reader.deserialize().map(panic_lambda).collect();
That has got to be clearer, perhaps also being obvious as to what it does.
- By defining the lambda function (closure) separately, we defined some "complex" (to read) behavior and packaged it into a name which is easy to read when contained inside a call to
.map()
.
We can see what the definition of the behavior (panic_lambda
) is. We can then see where it is used.
The actual series of function calls to create the final structure we want is much simplified. We now only have 3 things occurring in a single line:
- deserialization of data into a defined structure
- a call to map to handle error cases
- a call to collect to convert an iterator into a std::collection container
Also note that I swapped out the BTreeMap
for a BTreeSet
. There are two reasons for this:
When we read data from CSV, we get whole rows of data, not rows of data plus some additional id type which can be used as a key. This implies the correct container is a set, not a map.
If we later want a map, with a key, we can convert the set to a map with a key. But this is a whole different responsibility and behavior which should be handled elsewhere, probably with a specific function to do this.
This also has the effect of making the code more readable. We eliminated these lines of code:
assert!(... something);
assert!(... something);
(data_row.id, data_row)
The purpose of those lines of code was to do the following:
- Perform an assertion to check data for consistency
- Create a pair (tuple) with a key and value which could be used to construct the previous
BTreeMap
We can, of course, still perform these behaviors, but they are better suited to be placed in a function, elsewhere.
Discussion: Is the resulting code slow?
There is a final consideration to be made here. The original code which I presented was in some sense "over-optimized", or perhaps even "optimized".
By dealing only with a single container type, rather than using a separate BTreeSet
as an intermediary, it is presumably faster. (I say presumably because I haven't measured the performance difference. Without doing so this discussion is really speculative and a bit pointless.)
However, early-optimization is the root of all evil:
- We don't yet know if this code is in a hot loop or a bottleneck for our particular engineering application
- Without knowing that, we shouldn't be writing code which is hard to read, understand or modify, just to save a few milliseconds of runtime, when this might be running alongside some other algorithm which takes hours to complete
This answer started while I was searching for ways to handle iterators which return Result
types. For the particular context which was of interest to me, this came about from using the (currently) most popular Rust CSV library.
While attempting to find a solution I encounter a lot of obscure code, which was hard to understand and not at all intuitive.
In this answer I took a concrete example and explained two things:
- How to deal with this situation, and related situations where the developer encounters a
Result
type returned from an iterator and wishes to collect
that into a container
- How to write the code to handle this in a clear way which is easy(ier) to read and understand
Other developers will of course have slightly differing requirements. For example, many will not wish to simply panic!
when bad data is encountered. However, the basic ideas here can be extended to write code to provide alternative error handling.
In the specific case I am working with it makes no sense for the application to attempt to run if the input data is "bad" or unreadable. So it should just abort. Hence panic!
is the correct behavior in this particular instance.
Bonus: One final advantage:
Say our file does contain bad data, and I didn't expect that. It would be convenient to be able to add more logging.
By defining the lambda separately, this is trivial:
let panic_lambda_debug =
|x: Result<DataRowStruct, csv::Error>| {
match x{
Err(exception) => {
println!("error: {}", exception);
panic!("bad data");
}
Ok(value) => {
value
}
}
};
We can swap between the two behavior defined by panic_lambda
and panic_lambda_debug
really very easily. We just change a single word in the source code.