Dumping Hbase table to CSV using spark results in loss of data

Question

When exporting data from hbase to csv using phoenix and spark results in data loss. I have 22 million rows in my hbase table and when I export it to csv, only 19 million rows are there. 3 million rows are missing.

I have tried caching the dataframe before writing to csv but still 19 million rows is the result. I have used coalesce as I need everything in one CSV file.

I also tried exporting by using !record in phoenix but the problem here is, the data is huge and it takes forever to load.

!outputformat csv
!record data.csv
select * from table;
!record
!quit

If there is anyway I could export my hbase table without any data loss? Or maybe someone could help me with editing my code or any suggestions would be of great help.

My Spark Code in Scala:

import org.apache.log4j.lf5.LogLevel
import org.apache.spark.sql.SparkSession

object ExportCSV {
  def main(args: Array[String]): Unit = {

    val table_name = "xyz"
    val phoenix_zk = "jdbc:phoenix:zkurl"

    Logger.getLogger("org").setLevel(Level.ERROR)
    Logger.getLogger("akka").setLevel(Level.ERROR)

    val spark = SparkSession.builder().master("local[*]")
      .config("spark.debug.maxToStringFields", "100")
      //.config("spark.driver.maxResultSize", "2g")
      .config("spark.yarn.executor.memoryOverhead", "4g")
      .appName("SparkPhoenix")
      .getOrCreate()



    val df = spark.read.format("jdbc").options(
    Map("driver" -> "org.apache.phoenix.jdbc.PhoenixDriver",
    "url" -> phoenix_zk,
    "dbtable" -> xyz)).load()


    print(df.count())  //22 million rows in dataframe
    df.cache()
    print(df.count())  //19 million rows after cache

    df.explain(extended = true)


    df.coalesce(1).write.mode("append").option("header", "true").csv("/tchiring/export_test")

  }
}

https://stackoverflow.com/questions/44002128/when-are-cache-and-persist-executed-since-they-dont-seem-like-actions So it is very likely you're reading your table twice. Any chance that delta is the result of some other actions/process on HBase side? — mazaneicha, Oct 24 '19 at 16:45

Dumping Hbase table to CSV using spark results in loss of data

0 Answers0