9

I read a flat file (for example a .csv file with 1 line per User, Ex: UserId;Data1;Date2).

But how to handle duplicated User item in the reader (where is no list of previus readed users...)

stepBuilderFactory.get("createUserStep1")
.<User, User>chunk(1000)
.reader(flatFileItemReader) // FlatFileItemReader
.writer(itemWriter) // For example JDBC Writer
.build();
Aure77
  • 3,034
  • 7
  • 33
  • 53

3 Answers3

24

Filtering is typically done with an ItemProcessor. If the ItemProcessor returns null, the item is filtered and not passed to the ItemWriter. Otherwise, it is. In your case, you could keep a list of previously seen users in the ItemProcessor. If the user hasn't been seen before, pass it on. If it has been seen before, return null. You can read more about filtering with an ItemProcessor in the documentation here: https://docs.spring.io/spring-batch/docs/current/reference/html/processor.html#filteringRecords

/**
* This implementation assumes that there is enough room in memory to store the duplicate
* Users.  Otherwise, you'd want to store them somewhere you can do a look-up on.
*/
public class UserFilterItemProcessor implements ItemProcessor<User, User> {

    // This assumes that User.equals() identifies the duplicates
    private Set<User> seenUsers = new HashSet<User>();

    public User process(User user) {
        if(seenUsers.contains(user)) {
            return null;
        }
        seenUsers.add(user);
        return user;
        
    }
}
rochb
  • 2,249
  • 18
  • 26
Michael Minella
  • 20,843
  • 4
  • 55
  • 67
  • After fetching my last questions on stackoverflow, I found my solution (what you say) : http://stackoverflow.com/a/26318180/1121571 Is this the best solution (keep a list in item processor) because a list I passed to ItemWriter, so where it is stored internally ? How can I access to it properly ? – Aure77 Dec 05 '14 at 16:00
  • This would be custom `ItemProcessor` implementation so it's up to you where to store the previously seen users. – Michael Minella Dec 05 '14 at 23:12
  • 1
    How to keep the last entry ? – Clement Martino Oct 08 '15 at 07:58
  • How can I reset the Set? Wont it lead to memory leaks? @MichaelMinella – Blanca Hdez Nov 27 '17 at 07:53
  • @BlancaHdez You can use a listener or implement the `ItemStream` interface and reset the `Set` in the `close` method. – Michael Minella Nov 27 '17 at 15:49
6

As you could see here http://docs.spring.io/spring-batch/trunk/reference/html/readersAndWriters.html#faultTolerant

When a chunk is rolled back, items that have been cached during reading may be reprocessed. If a step is configured to be fault tolerant (uses skip or retry processing typically), any ItemProcessor used should be implemented in a way that is idempotent

This means that in Michael's example, the first time a user is Processed the user is cached in the Set and if there is a failure Writing the item, if the step is fault tolerance the Processor will be executed again for the same User and this Filter will filter out the user.

Improved code:

/**
 * This implementation assumes that there is enough room in memory to store the duplicate
 * Users.  Otherwise, you'd want to store them somewhere you can do a look-up on.
 */
public class UserFilterItemProcessor implements ItemProcessor<User, User> {

    // This assumes that User.equals() identifies the duplicates
    private Set<User> seenUsers = new HashSet<User>();

    public User process(User user) {
        if(seenUsers.contains(user) && !user.hasBeenProcessed()) {
            return null;
        } else {
            seenUsers.add(user);
            user.setProcessed(true);
            return user;
        }
    }
}
josee
  • 61
  • 1
  • 1
0

You can overwrite the equals() and hashcode() method of User, then you can delete the "contains" codition.

Eric Aya
  • 69,473
  • 35
  • 181
  • 253