I have a pyspark dataframe with unique uid field, and I write a map partition function to sum the occurrence of users. the map partition code is:
def get_mark_num(p):
hbase_util = HappyBaseUtil("localhost", 9090)
for r in p:
uid = str(r.uid)
mark_num_d = hbase_util.get_row("test", row=uid, columns=["info:marks"])
if len(mark_num_d) == 0:
hbase_util.put("test", row=uid, data={"info:marks": str(1)})
yield Row(uid=r.uid, name=r.name, mark_num=1)
else:
mark_num = int(mark_num_d.get("info:marks")) + 1
hbase_util.put("test", row=uid, data={"info:marks": str(mark_num)})
yield Row(uid=r.uid, name=r.name, mark_num=mark_num)
df.rdd.mapPartitions(get_mark_num).toDF()
The uid is unique in dataframe, and the test
table in hbase is empty. When I first run the code, it get a duplicated user, and in the test
table, the mark_num is 2, but it should be 1. I can not find where is wrong with my code.
And I print the data in mapPartition, it got two result of one user(and I'm sure there only one user 391 in the dataframe):
===========this uid: 391, name: Jr.
==========time:1673509732.8683832. uid:391, mark_num_d: {}
===========this uid: 391, name: Jr.
==========time:1673509735.725803. uid:391, mark_num_d: {'info:marks': '1'}
spark version: 2.4.0-cdh6.3.2 hbase version: 2.1.0-cdh6.3.2