Total number of events in predictionio are showing less than the actual events

Question

I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase - 1.2.6 , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have uploaded near about 1 million events(each containing 30k features) . while uploading I can see the size of hbase disk increasing and after all the events got uploaded the size of hbase disk is 567GB. In order to verify I ran the following commands

 - pio-shell --with-spark --conf spark.network.timeout=10000000 --driver-memory 30G --executor-memory 21G --num-executors 7 --executor-cores 3 --conf spark.driver.maxResultSize=4g --conf spark.executor.heartbeatInterval=10000000
 - import org.apache.predictionio.data.store.PEventStore
 - val eventsRDD = PEventStore.find(appName="test")(sc)
 - val c = eventsRDD.count()

it shows event counts as 18944

After that from the script through which I uploaded the events, I randomly queried with there events Id and I was getting that event.

I don't know how to make sure that all the events uploaded by me are there in the app. Any help is appreciated.

I have encountered this problem. Have you found any solution? — decazuk, Jul 24 '18 at 11:46

score 1 · Accepted Answer · answered Jul 25 '18 at 05:44

Finally I figure out what happened in

org.apache.predictionio.data.storage.hbase.HBPEvents

val scan = HBEventsUtil.createScan(
    startTime = startTime,
    untilTime = untilTime,
    entityType = entityType,
    entityId = entityId,
    eventNames = eventNames,
    targetEntityType = targetEntityType,
    targetEntityId = targetEntityId,
    reversed = None)
scan.setCaching(500) // TODO
scan.setCacheBlocks(false) // TODO

scan.setCaching(500) may cause request timeout. You can try lower caching value for this. You need change the source code and recompile.

Total number of events in predictionio are showing less than the actual events

1 Answers1