0

Start to try out the Apache Beam and try to use it to read and count HBase table. When try to read the table without the Count.globally, it can read the row, but when try to count number of rows, the process hung and never exit.

Here is the very simple code:

Pipeline p = Pipeline.create(options);

p.apply("read",    HBaseIO.read().withConfiguration(configuration).withTableId(HBASE_TABLE))
  .apply(ParDo.of(new DoFn<Result, String>() {
   @ProcessElement
   public void processElement(ProcessContext c) {
        Result result = c.element();
        String rowkey = Bytes.toString(result.getRow());
        System.out.println("row key: " + rowkey);
        c.output(rowkey);
   }
}))
.apply(Count.<String>globally())
.apply("FormatResults", MapElements.via(new SimpleFunction<Long, String>() {
      public String apply(Long element) {
          System.out.println("result: " + element.toString());
          return element.toString();
      }
 }));

when use Count.globally, the process never finish. When comment it out, the process print all the rows.

Anyy ideas?

David Wang
  • 41
  • 7

1 Answers1

1

Which version of beam are you using?

Thanks for bringing this issue. I tried to reproduce your case and indeed there seems to be an issue with colliding versions of guava that breaks transforms with HBaseIO. I sent a pull request to fix the shading of this, I will keep you updated once it is merged so you can test if it works.

Thanks again.

iemejia
  • 81
  • 4
  • Thanks a lot, Please let me know when it ready. We see Beam is the future of new era of data process. Thanks for the response – David Wang Apr 05 '17 at 01:49
  • Hi, my PR is already merged, so it should be OK, now, can you try again, btw, you need to be using the latest beam version 0.7.0-SNAPSHOT from apache to get the jars with the fixes. Please tell me if it works for you. – iemejia Apr 05 '17 at 09:03
  • You should probably add this to your pom and be sure that this is included. apache.snapshots Apache Development Snapshot Repository https://repository.apache.org/content/repositories/snapshots/ false true – iemejia Apr 05 '17 at 09:06
  • One extra thing, there is an open discussion on moving from ByteString in the public APIs (HBaseIO uses it), so the code will probably evolve, I will keep you updated here when this happens. – iemejia Apr 05 '17 at 09:11
  • Hi, I try out the 0.7.0-SNAPSHOT build, still have the same problem. I use the repository you mentioned. It still block on count.globally. Without count, hbase data read is fine. I also git pull the source code and build myself, the result is same, the problem still there. Other thing I suggest whether can add the maven profile to build java sdk only. Right now it build java and python all together, It take too long to build. Thanks a lot – David Wang Apr 06 '17 at 04:56
  • Hello, sorry for coming back to this so late, we are close to getting the first stable release of Beam and I just wanted to confirm with you if this is working correctly now. Notice that the API of HBaseIO slightly changed to get rid of the external dependency on Protocol Buffers (as well as some parts of the SDK/runners) that I think were causing the problem you reported. Could you please test this and confirm me if it is working correctly now (if this is not the case I will put this in my urgent TODO). Remember the new version is 2.0.0-SNAPSHOT. Thanks in advance. – iemejia May 12 '17 at 04:17
  • Hi, glad to hear from you. I test the Count.globally against 2.1.0-SNAPSOT. It works in the local mode. I will try against the spark cluster. The count.globally works, but the simple TextIO doesn't work any more. I piped apply(TextIO.write().to("broker_log_linecount") to write the counter result. It throw following exception: Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalStateException: Unable to find registrar for d at Caused by: java.lang.IllegalStateException: Unable to find registrar for d – David Wang May 16 '17 at 04:27
  • Another suggestion is whether the project source can add maven profile to build java sdk only. Since we don't use python at all. Also whether have any plan to support Java 9 REPL, so it can dynamic write code to execution – David Wang May 16 '17 at 04:33
  • I found out the TextIO problem. The TestIO doesn't work under windows file system. It works fine under linux. – David Wang May 16 '17 at 08:58
  • Another comment, I have hard time to make this simple program to use Spark cluster. I mvn clean package -Pspark-runner, it create the right jar. Then use the spark-submit. The job show up in the spark console. But the process is blocked: BlockManagerMasterEndpoint:54 - Registering block manager 192.168.2.100:31537 with 434.4 MB RAM, BlockManagerId. It can't continue. Switch back to DirectRunner, it is ok. – David Wang May 16 '17 at 10:42
  • Try out the flink runner local and single cluster, it much better than the spark runner. I think the spark runner has some problem – David Wang May 16 '17 at 16:13
  • Sorry again, somehow missed the notification on your answer. Good to know that HBaseIO works now. – iemejia Jun 05 '17 at 11:52
  • For the python part that you mention yes, it is a bit annoying but we have to support this, so the only way is just to skip the module, mvn --projects '!sdks/python' clean compile test-compile – iemejia Jun 05 '17 at 11:52
  • For the TextIO issue effectively I think there is an ongoing JIRA, apparently the issue is with some misinterpretation on the path e.g. 'C:\' that is some how implemented as a different file handler. – iemejia Jun 05 '17 at 11:54
  • I agree that running in Spark is more complex that it should, we have to fix that too, but we were first trying to fix the submit process of the spark runner to work on cluster (right now you have to do this by spark-submit or it won't work). I will keep you posted if I see some progress in that area. – iemejia Jun 05 '17 at 11:55