pyspark dataframe with hbase connector get wrong data

Question

I have a pyspark dataframe with unique uid field, and I write a map partition function to sum the occurrence of users. the map partition code is:

def get_mark_num(p):
  hbase_util = HappyBaseUtil("localhost", 9090)
            
  for r in p:
     uid = str(r.uid)
     mark_num_d = hbase_util.get_row("test", row=uid, columns=["info:marks"])

     if len(mark_num_d) == 0:
        hbase_util.put("test", row=uid, data={"info:marks": str(1)})
        yield Row(uid=r.uid, name=r.name, mark_num=1)
     else:
         mark_num = int(mark_num_d.get("info:marks")) + 1
         hbase_util.put("test", row=uid, data={"info:marks": str(mark_num)})
         yield Row(uid=r.uid, name=r.name, mark_num=mark_num)

df.rdd.mapPartitions(get_mark_num).toDF()

The uid is unique in dataframe, and the test table in hbase is empty. When I first run the code, it get a duplicated user, and in the test table, the mark_num is 2, but it should be 1. I can not find where is wrong with my code.

And I print the data in mapPartition, it got two result of one user(and I'm sure there only one user 391 in the dataframe):

===========this uid: 391, name: Jr.
==========time:1673509732.8683832. uid:391, mark_num_d: {}

===========this uid: 391, name: Jr.
==========time:1673509735.725803. uid:391, mark_num_d: {'info:marks': '1'}

spark version: 2.4.0-cdh6.3.2 hbase version: 2.1.0-cdh6.3.2

And I also find only the first element in the partition get the wrong sum. — littlely, Jan 12 '23 at 08:03

pyspark dataframe with hbase connector get wrong data

0 Answers0