How can I convert an iterator of `Result` types into something which can be `collect`-ed into a std::collection container?

Question

I am working with the Rust CSV package, which returns iterators of Result types.

For example, the following snippet of code is relevant:

let mut csv_reader =
    csv::Reader::from_path("path_to_data")
        .unwrap();

let data_table: BTreeMap<IdType, DataRowStruct> =
        csv_reader.deserialize().collect();

This code does not compile, because deserialize returns an iterator of Results, which cannot be directly collected into the defined type.

DataRowStruct is a Rust struct which defines the columns of the expected CSV data.

While researching this question I came across two other related questions.

Neither one provides a satisfactory answer - I already had the following suggested to me by someone else (not on Stack Overflow).

let data_table: BTreeMap<IdType, DataRowStruct> = csv_reader
    .deserialize()
    .flatten()
    .map(|data_row: DataRowStruct| {
        assert!(... something);
        assert!(... something);
        (data_row.id, data_row)
        })
        .collect();

I also did not think this solution is particularly good. In short, it is hard to understand because it tries to do far too many things in a single line of code. Although developers have varying opinions on Clean Code principles, this is an obvious violation of many of them, most notably single (or at least minimal) responsibility.

Regardless of the "Clean" aspect to the solution, in the particular use case which I have in front of me here, the required behavior is to abort if "bad data" is detected. (In other words, the code should panic! if any failure is encountered, especially if deserialize returns Err.

The reason for this is that this program populates data into a system. We want to know if any of the data is bad, and abort. We do not want to insert partially complete datasets based on the values which could be deserialized. If something cannot be deserialized, that means we need to fix out input.

In summary, I am looking for a solution which:

Works for this context (using deserialize function call with CSV reader in combination with collect to create a std::collection type)
Should panic or otherwise abort if the CSV deserialization fails, because this indicates we need to fix our CSV (input)
Should be a "Clean" solution. It should not take 20 lines to perform such a simple operation and it should be relatively easy to read and understand, and handle a minimal number of responsibilities

FreelanceConsultant · Answer 1 · 2023-08-31T10:02:52.723

tl;dr answer:

let panic_lambda =
    |x: Result<DataRowStruct, csv::Error>| { x.expect("bad data") };

let data_table: std::collections::BTreeSet<DataRowStruct> =
    csv_reader.deserialize().map(panic_lambda).collect();

As a general comment I have noticed that lots of Rust code which developers tend to write becomes really quite difficult to understand as soon as iterators and results are combined.

For example with the CSV package, I previously saw this example, which by the standards of things I have seen elsewhere isn't too bad, but still isn't wonderful. (This is my friends solution by the way. I know him personally, and while this works, I think we can do better.)

let data_table: BTreeMap<IdType, DataRowStruct> = csv_reader
    .deserialize()
    .flatten()
    .map(|data_row: DataRowStruct| {
        assert!(... something);
        assert!(... something);
        (data_row.id, data_row)
        })
        .collect();

// elsewhere:
struct DataRowStruct {
    pub something: Type,
    ... others
}

Can you understand what this is trying to do? I can't - at least not easily.

The code in this example isn't readable or clear because it tries to do too many things at once.

Let's count them:

load a file from disk, it happens to be a CSV format text file
use CSV Reader to deserialize the data into a predefined structure, defined by DataRowStruct
flatten(), whatever that does (?!) (see note)
apply a lambda function (closure) using map which does some stuff including assert! and some other stuff with parenthesis
we can't really tell what the lambda function (closure) does without reading it slowly and carefully inspecting where the brackets and parenthesis are
finally, we call collect, which we know what that does (builds a collection from an iterator)

Note: flatten, I discovered, after some searching, has the effect of skipping blank lines and bad data. But this isn't at all obvious from the above code, because the purpose of flatten is to take things such as nested collections and "flatten" them into a single collection with no nesting.

That's something like 6 responsibilities, all in a single line of code. (A single line of code split over multiple actual lines of text in the IDE, but it's still a single line as seen by the compiler.)

By contrast, this is what I came up with after some thinking about it.

let panic_lambda =
    |x: Result<DataRowStruct, csv::Error>| { x.expect("bad data") };

let data_table: std::collections::BTreeSet<DataRowStruct> =
    csv_reader.deserialize().map(panic_lambda).collect();

That has got to be clearer, perhaps also being obvious as to what it does.

By defining the lambda function (closure) separately, we defined some "complex" (to read) behavior and packaged it into a name which is easy to read when contained inside a call to .map().

We can see what the definition of the behavior (panic_lambda) is. We can then see where it is used.

The actual series of function calls to create the final structure we want is much simplified. We now only have 3 things occurring in a single line:

deserialization of data into a defined structure
a call to map to handle error cases
a call to collect to convert an iterator into a std::collection container

Also note that I swapped out the BTreeMap for a BTreeSet. There are two reasons for this:

When we read data from CSV, we get whole rows of data, not rows of data plus some additional id type which can be used as a key. This implies the correct container is a set, not a map.

If we later want a map, with a key, we can convert the set to a map with a key. But this is a whole different responsibility and behavior which should be handled elsewhere, probably with a specific function to do this.

This also has the effect of making the code more readable. We eliminated these lines of code:

assert!(... something);
assert!(... something);
(data_row.id, data_row)

The purpose of those lines of code was to do the following:

Perform an assertion to check data for consistency
Create a pair (tuple) with a key and value which could be used to construct the previous BTreeMap

We can, of course, still perform these behaviors, but they are better suited to be placed in a function, elsewhere.

Discussion: Is the resulting code slow?

There is a final consideration to be made here. The original code which I presented was in some sense "over-optimized", or perhaps even "optimized".

By dealing only with a single container type, rather than using a separate BTreeSet as an intermediary, it is presumably faster. (I say presumably because I haven't measured the performance difference. Without doing so this discussion is really speculative and a bit pointless.)

However, early-optimization is the root of all evil:

We don't yet know if this code is in a hot loop or a bottleneck for our particular engineering application
Without knowing that, we shouldn't be writing code which is hard to read, understand or modify, just to save a few milliseconds of runtime, when this might be running alongside some other algorithm which takes hours to complete

This answer started while I was searching for ways to handle iterators which return Result types. For the particular context which was of interest to me, this came about from using the (currently) most popular Rust CSV library.

While attempting to find a solution I encounter a lot of obscure code, which was hard to understand and not at all intuitive.

In this answer I took a concrete example and explained two things:

How to deal with this situation, and related situations where the developer encounters a Result type returned from an iterator and wishes to collect that into a container
How to write the code to handle this in a clear way which is easy(ier) to read and understand

Other developers will of course have slightly differing requirements. For example, many will not wish to simply panic! when bad data is encountered. However, the basic ideas here can be extended to write code to provide alternative error handling.

In the specific case I am working with it makes no sense for the application to attempt to run if the input data is "bad" or unreadable. So it should just abort. Hence panic! is the correct behavior in this particular instance.

Bonus: One final advantage:

Say our file does contain bad data, and I didn't expect that. It would be convenient to be able to add more logging.

By defining the lambda separately, this is trivial:

let panic_lambda_debug =
    |x: Result<DataRowStruct, csv::Error>| {

        match x{

            Err(exception) => {
                println!("error: {}", exception);
                panic!("bad data");
            }

            Ok(value) => {
                value
            }
        }
    };

We can swap between the two behavior defined by panic_lambda and panic_lambda_debug really very easily. We just change a single word in the source code.

The whole first part is more or less noise repeated from the question, then you go and present a "solution" that semantically is quite different from the one presented in the question and then repated here. I'd say the solutions in the duplicates are quite more idiomatic rust than panicking on user input. — cafce25, Aug 28 '23 at 16:25
`flatten` as with other collections converts an iterator over collections (or more specifically anything that implements `IntoIterator`) into an iterator over `T`. A `Result` or `Option` is a collection of up to one element. — cafce25, Aug 28 '23 at 18:36
@cafce25 Consider the following situation. You are writing some basic utility to populate some data in a database from a CSV file. It is a throw-away thing which you need to use once, or maybe again in a couple of months time if there is some updated dataset to populate. When reading the data, the data may be bad, and if so, your utility cannot continue running. **What else do you do other than panic?** — FreelanceConsultant, Aug 28 '23 at 20:00
You can still add the `.expect()` at the `.collect::, E>>()` site without a closure necessary. Also a throwaway script isn't really a *"idiomatic Rust"* situation. — cafce25, Aug 28 '23 at 20:08
I like self-answers, _if they will be useful for future readers_. This question is just _not_. It has a very specific (and I'll also say unusual) set of requirements. It won't be helpful to anyone but you. It would be fine if you would be seeking an answer; given that you already have an answer this question should be deleted. — Chayim Friedman, Aug 28 '23 at 22:10
@ChayimFriedman This used to be an answer on this question: https://stackoverflow.com/questions/26368288/how-do-i-stop-iteration-and-return-an-error-when-iteratormap-returns-a-result/ — FreelanceConsultant, Aug 29 '23 at 08:11
I did not want to write a whole new question to "move" the answer elsewhere, but was forced to do so — FreelanceConsultant, Aug 29 '23 at 08:12
You should've just delete it. Yes, I know it's painful to delete something you invested in, but sometimes it is the best thing to do. — Chayim Friedman, Aug 29 '23 at 08:16
Well, I can't decide for you. You can do whatever you want with your question (as long as it's not against the site rules). But currently, it just seems to attract downvotes. — Chayim Friedman, Aug 29 '23 at 12:47
Is there an answer here or is this just repeating in more detail the existing solutions which you find unattractive? — TylerH, Aug 30 '23 at 15:03

How can I convert an iterator of `Result` types into something which can be `collect`-ed into a std::collection container?

1 Answers1

tl;dr answer:

Discussion: Is the resulting code slow?

Bonus: One final advantage: