3

Few days ago I had a struggle with a strange bug, that occurred in my map reduce task.

Finally, it turned out that hadoop ValueIterable class that implements Iterable interface creates a single instance of iterator and returns it on every call of iterator() method.

protected class ValueIterable implements Iterable<VALUEIN> {
  private ValueIterator iterator = new ValueIterator();
  @Override
  public Iterator<VALUEIN> iterator() {
    return iterator;
  } 
}

That means if you iterate over ValueIterable once, you are not able to iterate it again.

I decided to check java documentation and seems that it does not require Iterable to return different iterators every time (or just missing the requirement?). Diving deeper I found this answer telling that having a single iterator violates Iterator contract, since it can not traverse the collection more than once.

  1. Who is correct here? Should Iterable return new iterators? Why are java docs unclear?

  2. What would be the correct way for this hadoop class to tell client that traverse is impossible? I mean if it will throw IllegalStateException, would it violate Iterator#hasNext() method contract?

Community
  • 1
  • 1
AdamSkywalker
  • 11,408
  • 3
  • 38
  • 76
  • ValueIterator has reset method. What does it do? May be it's that you need. – dgabriel Dec 22 '15 at 09:39
  • @DenisGavrus maybe it does, but in client map reduce code there is only Iterable interface, and it is not a good idea to cast this Iterable to inner hadoop class. And the question is more about specification, this hadoop class is just a starting point. – AdamSkywalker Dec 22 '15 at 10:02

1 Answers1

3

From here:

The Iterator you receive from that Iterable's iterator() method is special. The values may not all be in memory; Hadoop may be streaming them from disk. They aren't really backed by a Collection, so it's nontrivial to allow multiple iterations.

There is no actual defined contract that states that each Iterator returned by Iterable.iterator() should repeat the same sequence. This is only a custom because it is expected behaviour.

Hadoop - or any other library - is therefore allowed to break the rules on this.

The java docs are unclear for exactly this purpose - to let the implementors of Iterable have the wiggle room to do it any way they want.

How you should do it - like the other answers mentioned in the link - retain a list of already iterated items for a later repeat iteration - but be warned, this may be a huge collection in a live hadoop environment so you may well break.

Community
  • 1
  • 1
OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213
  • The real problem is not about fixing this particular issue - I've already done that. I am worried because there were no signals for me, that iteration is not possible and I just got unexpected behavior. And I'd like to know who is guilty here and how it could be fixed. – AdamSkywalker Dec 22 '15 at 10:54
  • @AdamSkywalker - I'd say the mistake was made by the hadoop documentation or yourself for not reading it. i.e. Either the docs are unclear or you didn't read them carefully enough. – OldCurmudgeon Dec 22 '15 at 11:00