Use Java 8 stream's map function for long running tasks

Question

I have a method that takes a List of URLs (of remote files) as parameter that should be downloaded. The method returns a List of other type (called Attachment), that actually holds a property of java File-type inside. For this case I used the Java Stream API to iterate over the URLs and start the download within a "map"-function that actually then returns the Attachment instance.

Now my question: Am I abusing the Java Stream API for things that its not meant for? Like putting long running tasks into it? Should I just do small operations on the input data?

The only minus I see right now is that it is a bit harder to test.

private List<Attachment> download(List<URL> attachments) {
        return attachments.stream().map(attachmentUrl -> {
            try {
                Attachment attachment = new Attachment();
                File attachmentFile = new File(getFilename(attachment.getAttachmentId(), attachmentUrl));
                FileUtils.copyURLToFile(
                        attachmentUrl,
                        attachmentFile,
                        CONNECT_TIMEOUT,
                        READ_TIMEOUT);
                attachment.setAttachmentFile(attachmentFile);
                return attachment;
            } catch (IOException e) {
                e.printStackTrace();
                LOGGER.error(e.getLocalizedMessage());
            }
            return null;
        }).filter(Objects::nonNull).collect(Collectors.toList());
    }

The abuse here is not the long operation. It's just the lack of readability. Put the internal operation into a couple of methods. E.g. One that converts a URL to file and handles the exception, and one that creates an attachment from it, and then use these as map operations. — RealSkeptic, Oct 28 '19 at 17:06
"The only minus I see right now is that it is a bit harder to test." As opposed to an implementation of something like a for loop? Personally, I don't see how testing would be different between the two — Knox, Oct 28 '19 at 17:10
Are there any reasons that you are not doing this asynchronously and parallel? Just asking because of performance, if you are having second-guesses because of "long runningness". — D. Kovács, Oct 28 '19 at 17:19
@D.Kovács the call to this function runs in its own thread triggered by a Spring-Boot Kafka Listener. I actually used the stream() as parallelStream() but it got bad rating in the Code-Review — moh, Oct 29 '19 at 07:22
@moh As it should :) cf. this gist: https://gist.github.com/AFulgens/ba1fec3235cfda1269550fb8e9793db3 But if your concern really is that it is long running, then it may have merit to parallelize it and off-load the actual task into different threads. However, I agree with RealSkeptic: a plain old enhanced for-loop is much more readable in this scenario (and debuggable!) than streams. — D. Kovács, Oct 29 '19 at 08:24
About parallelization: be careful. I'll update my answer below. — MyStackRunnethOver, Oct 29 '19 at 17:09

score 1 · Answer 1 · answered Oct 28 '19 at 17:37

Streams are very elegant tool to deal with data in a functional programming manner, same input will result in same output with no side effects, this will make your code less error prone and more readable. So there is no abuse in term of usage no matter the size of input. You can use parallel streams if you expect dealing with huge amount of data. However, your implementation could use a little cleanup, don't delegate all the business logic to a single map operation, make it more granular and spread the logic into several mappers, you can declare mappers like any variable Function<URL, File> urlToFileMapper = url -> {...}, and plug the mapper into the stream, attachments.stream().map(urlToFileMapper).map(anotherDeclaredMapper)...

I have the constraint of 25 attachments that can be downloaded. Do you consider this as worth using parallel streams? My original implementation used it, but it was rejected in Code-Review. — moh, Oct 29 '19 at 07:27

MyStackRunnethOver · Accepted Answer · 2019-10-29T17:10:58.083

I think it might be helpful to think about map and other functional constructs (e.g. filter, reduce, etc.) not so much as functions, but rather as syntax. stream().map() is syntax which performs the functional equivalent of a for loop. Asking "am I abusing this syntax because of what I use it to execute?" is then less meaningful: for loops do not care how long the tasks they run on each iteration take, and neither does map. It's agnostic to the operation it's applying, so the only question is are you using the syntax appropriately, i.e. to loop over a collection, mapping from something to something.

In this context, where map is syntax, your desired operations are perfectly alright. However, your implementation could be cleaned up a bit.

attachmentUrl -> {
    try {
        Attachment attachment = new Attachment();
        File attachmentFile = new File(getFilename(attachment.getAttachmentId(), attachmentUrl));
        FileUtils.copyURLToFile(
                attachmentUrl,
                attachmentFile,
                CONNECT_TIMEOUT,
                READ_TIMEOUT);
        attachment.setAttachmentFile(attachmentFile);
        return attachment;
    } catch (IOException e) {
        e.printStackTrace();
        LOGGER.error(e.getLocalizedMessage());
    }
    return null;
}

Is a tad big for an inlined map lambda. In general, I tend to be skeptical, though not always disapproving, of any map lambda that requires curly braces, i.e. that takes up more than one line. I'd suggest refactoring this lambda into a named function, and possibly a couple, which are either nested (map(this::A), where A then calls B) or used serially by your stream operations map(this::A).map(this::B).

[EDIT:] Regarding parallelizing your stream: keep in mind that you're doing more than just CPU processing as part of this method - you seem to be doing network IO and file IO. If you execute in parallel, you'll be parallelizing not just your CPU utilization but also your network and disk usage. If network or disk is the dominant factor, not CPU, then parallelization will likely get you very little, and may make things worse. In general, more threads != faster network or disk read/write. You might find this question on parallel IO useful.

Use Java 8 stream's map function for long running tasks

2 Answers2