The code below is running locally but not on the cluster. It hangs on GroupReduceFunction and do not terminates even after hours (it takes for large data ~ 9 minutes to compute locally). The last message in the log:
GroupReduce (GroupReduce at main(MyClass.java:80)) (1/1) (...) switched from DEPLOYING to RUNNING.
The code fragment:
DataSet<MyData1> myData1 = env.createInput(new UserDefinedFunctions.MyData1Set());
DataSet<MyData2> myData2 = DataSetUtils.sampleWithSize(myData1, false, 8, Long.MAX_VALUE)
.reduceGroup(new GroupReduceFunction<MyData1, MyData2>() {
@Override
public void reduce(Iterable<MyData1> itrbl, Collector<MyData2> clctr) throws Exception {
int id = 0;
for (MyData1 myData1 : itrbl) {
clctr.collect(new MyData2(id++, myData1));
}
}
});
Any ideas how I could run this segment in parallel? Thanks in advance!