0

I am trying to bulk load data into hbase using the salted table approach as stated in this site: https://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/. While I am able to insert data but at random times I get

ERROR mapreduce.LoadIncrementalHFiles: IOException during splitting java.util.concurrent.ExecutionException: java.io.FileNotFoundException: File does not exist: /user//hfile/transactions/transaction_data/b1c6c47856104db0a1289c2b7234d1d7

I am using hbase-client 1.2.0 and hbase-server 1.2.0 as dependencies and using HFileOutputFormat2 class to write my HFile.

I tried using the writer's patch HFileOutputFormat2 https://github.com/gbif/maps/commit/ee4e0001486f3e8b37b034c5b05fc8c8d4e76ab9 but this will fail to write the HFile all together.

Below is the code portion that writes the HFile and loads it to hbase

dates.map { x =>
      val hbase_connection2 = ConnectionFactory.createConnection(hbaseconfig)
      val table2 = hbase_connection2.getTable(TableName.valueOf(hbase_table))
      val regionLoc2 = hbase_connection2.getRegionLocator(table2.getName)
      val admin2 = hbase_connection2.getAdmin

      val transactionsDF = sql(hive_query_2.replace("|columns|", hive_columns) + " " + period_clause.replace("|date|", date_format.format(x)))

      val max_records = transactionsDF.count()

      val max_page = math.ceil(max_records.toDouble/page_limit.toDouble).toInt

      val start_row = 0
      val end_row = page_limit.toInt

      val start_page = if(date_format.format(x).equals(bookmark_date)) {
        bookmarked_page
      }
      else{
        0
      }

      val pages = start_page to (if(max_records < page_limit.toInt){max_page-1}else{max_page})
if(max_records > 0) {
pages.map (page => {
          val sourceDF = transactionsDF
            .withColumn("tanggal", cnvrt_tanggal(transactionsDF.col("ctry_cd"), transactionsDF.col("ori_tanggal"), transactionsDF.col("ori_jam")))
            .withColumn("jam", cnvrt_jam(transactionsDF.col("ctry_cd"), transactionsDF.col("ori_tanggal"), transactionsDF.col("ori_jam")))
            .join(locations, transactionsDF.col("wsid") === locations.col("key"), "left_outer")
            .join(trandescdictionary, lower(transactionsDF.col("source_system")) === lower(trandescdictionary.col("FLAG_TRANS")) && lower(transactionsDF.col("trans_cd")) === lower(trandescdictionary.col("TRAN_CODE")), "left_outer")
            .filter(transactionsDF.col("rowid").between((start_row + (page * page_limit.toInt)).toString, ((end_row + (page * page_limit.toInt)) - 1).toString))
            .withColumn("uuid", timeUUID())
            .withColumn("created_dt", current_timestamp())

          val spp = new SaltPrefixPartitioner(hbase_regions)

          val saltedRDD = sourceDF.rdd.flatMap(r => {
            Seq((salt(r.getString(r.fieldIndex("uuid")), hbase_regions), Seq(r.get(r.fieldIndex(new String(cols(0).toLowerCase))), r.get(r.fieldIndex(new String(cols(1).toLowerCase))), r.get(r.fieldIndex(new String(cols(2).toLowerCase))), r.get(r.fieldIndex(new String(cols(3).toLowerCase))), r.get(r.fieldIndex(new String(cols(4).toLowerCase))), r.get(r.fieldIndex(new String(cols(5).toLowerCase))), r.get(r.fieldIndex(new String(cols(6).toLowerCase))), r.get(r.fieldIndex(new String(cols(7).toLowerCase))), r.get(r.fieldIndex(new String(cols(8).toLowerCase))), r.get(r.fieldIndex(new String(cols(9).toLowerCase))), r.get(r.fieldIndex(new String(cols(10).toLowerCase))), r.get(r.fieldIndex(new String(cols(11).toLowerCase))), r.get(r.fieldIndex(new String(cols(12).toLowerCase))), r.get(r.fieldIndex(new String(cols(13).toLowerCase))), r.get(r.fieldIndex(new String(cols(14).toLowerCase))), r.get(r.fieldIndex(new String(cols(15).toLowerCase))), r.get(r.fieldIndex(new String(cols(16).toLowerCase))), r.get(r.fieldIndex(new String(cols(17).toLowerCase))), r.get(r.fieldIndex(new String(cols(18).toLowerCase))), r.get(r.fieldIndex(new String(cols(19).toLowerCase))), r.get(r.fieldIndex(new String(cols(20).toLowerCase))), r.get(r.fieldIndex(new String(cols(21).toLowerCase))), r.get(r.fieldIndex(new String(cols(22).toLowerCase))))))
          })

          val partitionedRDD = saltedRDD.repartitionAndSortWithinPartitions(spp)

          val cells = partitionedRDD.flatMap(r => {
            val salted_keys = r._1
            val colFamily = hbase_colfamily

            Seq(
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(0).getBytes(), Bytes.toBytes(Option(r._2(0)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(1).getBytes(), Bytes.toBytes(Option(r._2(1)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(2).getBytes(), Bytes.toBytes(Option(r._2(2)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(3).getBytes(), Bytes.toBytes(Option(r._2(3)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(4).getBytes(), Bytes.toBytes(Option(r._2(4)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(5).getBytes(), Bytes.toBytes(Option(r._2(5)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(6).getBytes(), Bytes.toBytes(Option(r._2(6)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(7).getBytes(), Bytes.toBytes(Option(r._2(7)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(8).getBytes(), Bytes.toBytes(Option(r._2(8)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(9).getBytes(), Bytes.toBytes(Option(r._2(9)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(10).getBytes(), Bytes.toBytes(Option(r._2(10)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(11).getBytes(), Bytes.toBytes(Option(r._2(11)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(12).getBytes(), Bytes.toBytes(Option(r._2(12)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(13).getBytes(), Bytes.toBytes(Option(r._2(13)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(14).getBytes(), Bytes.toBytes(Option(r._2(14)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(15).getBytes(), Bytes.toBytes(Option(r._2(15)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(16).getBytes(), Bytes.toBytes(Option(r._2(16)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(17).getBytes(), Bytes.toBytes(Option(r._2(17)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(18).getBytes(), Bytes.toBytes(Option(r._2(18)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(19).getBytes(), Bytes.toBytes(Option(r._2(19)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(20).getBytes(), Bytes.toBytes(Option(r._2(20)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(21).getBytes(), Bytes.toBytes(Option(r._2(21)).getOrElse("").toString))),
              (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), cols(22).getBytes(), Bytes.toBytes(Option(r._2(22)).getOrElse("").toString)))
            )
          })

          val job = Job.getInstance(hbaseconfig, "Insert Transaction Data Row " + (start_row + (page * page_limit.toInt)).toString + " to " + ((end_row + (page * page_limit.toInt)) - 1).toString + " for " + x.toString)
          HFileOutputFormat2.configureIncrementalLoad(job, table2, regionLoc2)

          val conf = job.getConfiguration

          if (fs.exists(path)) {
            fs.delete(path, true)

            cells.saveAsNewAPIHadoopFile(
              path.toString,
              classOf[ImmutableBytesWritable],
              classOf[KeyValue],
              classOf[HFileOutputFormat2],
              conf
            )
          }
          else if (!fs.exists(path)) {
            cells.saveAsNewAPIHadoopFile(
              path.toString,
              classOf[ImmutableBytesWritable],
              classOf[KeyValue],
              classOf[HFileOutputFormat2],
              conf
            )
          }

          val bulk_loader = new LoadIncrementalHFiles(conf)
          bulk_loader.doBulkLoad(path, admin2, table2, regionLoc2)

          conf.clear()
          println("Done For " + x.toString + " pages " + (start_row + (page * page_limit.toInt)).toString + " to " + ((end_row + (page * page_limit.toInt)) - 1).toString)

          if (fs.exists(bookmark)) {
            fs.delete(bookmark, true)
            Seq((date_format.format(x), page.toString)).map(r => (r._1, r._2)).toDF("Date", "Page").write.format("com.databricks.spark.csv").option("delimiter", "|").save(bookmark_path)
          }
          else {
            Seq((date_format.format(x), page.toString)).map(r => (r._1, r._2)).toDF("Date", "Page").write.format("com.databricks.spark.csv").option("delimiter", "|").save(bookmark_path)
          }
          0
        })
  }
  hbase_connection2.close()
  0
}

I am really at my wits end as I could not trace what is causing this error. I hope someone can give me some ideas on what could be the cause of this file splitting error.

Kok-Lim Wong
  • 103
  • 1
  • 10

2 Answers2

1

I think you may be seeing this: https://issues.apache.org/jira/projects/HBASE/issues/HBASE-21183

It is something I saw sporadically, so we never solved it. How regularly do you see it please?

  • Quite often (70% of the time) but there were times the script is able to finish completely without issues. – Kok-Lim Wong Aug 10 '19 at 12:01
  • And you say you see socket timeouts too? – timrobertson100 Aug 12 '19 at 09:15
  • Yes but it did not cause the split error as i see some iterations do insert data to hbase even with the timeout issue. It is not shown in my code but the set of code I pasted here is actually in another map which iterates through 93 days. It will usually insert about 50+ days before the split error occurs. – Kok-Lim Wong Aug 12 '19 at 13:48
  • i edited my original post to show the dates iteration. – Kok-Lim Wong Aug 12 '19 at 14:02
  • We found out the issue to be resource consumption. There was another project which at times eat up most or all of the resources in the cluster. – Kok-Lim Wong Jul 14 '20 at 15:32
0

HBASE-3871 seems like you have to parllelize and repartition your data is the solution for this.

see this code which is the orgin for the error

private Pair<Multimap<ByteBuffer, LoadQueueItem>, Set<String>> groupOrSplitPhase(
      AsyncClusterConnection conn, TableName tableName, ExecutorService pool,
      Deque<LoadQueueItem> queue, List<Pair<byte[], byte[]>> startEndKeys) throws IOException {
    // <region start key, LQI> need synchronized only within this scope of this
    // phase because of the puts that happen in futures.
    Multimap<ByteBuffer, LoadQueueItem> rgs = HashMultimap.create();
    final Multimap<ByteBuffer, LoadQueueItem> regionGroups = Multimaps.synchronizedMultimap(rgs);
    Set<String> missingHFiles = new HashSet<>();
    Pair<Multimap<ByteBuffer, LoadQueueItem>, Set<String>> pair =
      new Pair<>(regionGroups, missingHFiles);

    // drain LQIs and figure out bulk load groups
    Set<Future<Pair<List<LoadQueueItem>, String>>> splittingFutures = new HashSet<>();
    while (!queue.isEmpty()) {
      final LoadQueueItem item = queue.remove();

      final Callable<Pair<List<LoadQueueItem>, String>> call =
        new Callable<Pair<List<LoadQueueItem>, String>>() {
          @Override
          public Pair<List<LoadQueueItem>, String> call() throws Exception {
            Pair<List<LoadQueueItem>, String> splits =
              groupOrSplit(conn, tableName, regionGroups, item, startEndKeys);
            return splits;
          }
        };
      splittingFutures.add(pool.submit(call));
    }
    // get all the results. All grouping and splitting must finish before
    // we can attempt the atomic loads.
    for (Future<Pair<List<LoadQueueItem>, String>> lqis : splittingFutures) {
      try {
        Pair<List<LoadQueueItem>, String> splits = lqis.get();
        if (splits != null) {
          if (splits.getFirst() != null) {
            queue.addAll(splits.getFirst());
          } else {
            missingHFiles.add(splits.getSecond());
          }
        }
      } catch (ExecutionException e1) {
        Throwable t = e1.getCause();
        if (t instanceof IOException) {
          LOG.error("IOException during splitting", e1);
          throw (IOException) t; // would have been thrown if not parallelized,
        }
        LOG.error("Unexpected execution exception during splitting", e1);
        throw new IllegalStateException(t);
      } catch (InterruptedException e1) {
        LOG.error("Unexpected interrupted exception during splitting", e1);
        throw (InterruptedIOException) new InterruptedIOException().initCause(e1);
      }
    }
    return pair;
  }
Ram Ghadiyaram
  • 28,239
  • 13
  • 95
  • 121
  • Hi Ram, thanks for the reply just want to find out more on what you mean. I did perform a repartitionAndSortWithinPartition before writing in to the hfile. Are you saying I need to use parallize() as well? – Kok-Lim Wong Aug 12 '19 at 06:20
  • The reason I ask is because I thought rdd are already parallelized. – Kok-Lim Wong Aug 12 '19 at 07:34
  • When I logged HBASE-21183 there should have been no splitting. I had prepared 2GB well-balanced HFiles which 59/60 times loaded in immediately. Occasionally I saw this error. My suspicion is that on some HDFS read error (e.g. network dropped) it fails incorrectly and bulk load needed hardened. From private comms, I also know the poster here has seen socket timeouts during script running. – timrobertson100 Aug 12 '19 at 07:54
  • @Kok-LimWong RDD are parllelized but data may not unifomly distributed check that. is what i mean – Ram Ghadiyaram Aug 12 '19 at 14:57
  • @RamGhadiyaram thanks for clearing that up. So what should we do to make the data uniformly distributed? I believe this is already done when repartitionAndSortWithinPartitions() is called. – Kok-Lim Wong Aug 12 '19 at 15:40
  • repartitionAndSortWithinPartitions i doubt on this pls check repartition alone and see the result. also print the number of partitions in both cases to understand what is happening – Ram Ghadiyaram Aug 12 '19 at 15:56
  • If you following the approach in my blog you link above @Kok-LimWong then you should have well balanced data across HFiles (it is using a modulus salting approach). It might be worth checking the size of them on disk to make sure they are not unduly large and would require splitting. A good rule of thumb is to target 8GB regions giving a little room for growth before they'd split. – timrobertson100 Aug 13 '19 at 15:22
  • Hi Tim, thanks for the feedback. I did read in other articles that salted table is used to have a well balanced data. I am just confused why the splitting would happen as the region size is set to 20GB and each day would have about only 10k - 20k rows of data and each row would have the estimated size of 4590 bytes. – Kok-Lim Wong Aug 14 '19 at 02:12