When exporting data from hbase to csv using phoenix and spark results in data loss. I have 22 million rows in my hbase table and when I export it to csv, only 19 million rows are there. 3 million rows are missing.
I have tried caching the dataframe before writing to csv but still 19 million rows is the result. I have used coalesce as I need everything in one CSV file.
I also tried exporting by using !record in phoenix but the problem here is, the data is huge and it takes forever to load.
- !outputformat csv
- !record data.csv
- select * from table;
- !record
- !quit
If there is anyway I could export my hbase table without any data loss? Or maybe someone could help me with editing my code or any suggestions would be of great help.
My Spark Code in Scala:
import org.apache.log4j.lf5.LogLevel
import org.apache.spark.sql.SparkSession
object ExportCSV {
def main(args: Array[String]): Unit = {
val table_name = "xyz"
val phoenix_zk = "jdbc:phoenix:zkurl"
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("akka").setLevel(Level.ERROR)
val spark = SparkSession.builder().master("local[*]")
.config("spark.debug.maxToStringFields", "100")
//.config("spark.driver.maxResultSize", "2g")
.config("spark.yarn.executor.memoryOverhead", "4g")
.appName("SparkPhoenix")
.getOrCreate()
val df = spark.read.format("jdbc").options(
Map("driver" -> "org.apache.phoenix.jdbc.PhoenixDriver",
"url" -> phoenix_zk,
"dbtable" -> xyz)).load()
print(df.count()) //22 million rows in dataframe
df.cache()
print(df.count()) //19 million rows after cache
df.explain(extended = true)
df.coalesce(1).write.mode("append").option("header", "true").csv("/tchiring/export_test")
}
}