2

I am new to predictionIO V 0.12.0 (elasticsearch - 5.2.1 , hbase - 1.2.6 , spark - 2.6.0) Hardware (244 GB RAM and Core - 32) . I have uploaded near about 1 million events(each containing 30k features) . while uploading I can see the size of hbase disk increasing and after all the events got uploaded the size of hbase disk is 567GB. In order to verify I ran the following commands

 - pio-shell --with-spark --conf spark.network.timeout=10000000 --driver-memory 30G --executor-memory 21G --num-executors 7 --executor-cores 3 --conf spark.driver.maxResultSize=4g --conf spark.executor.heartbeatInterval=10000000
 - import org.apache.predictionio.data.store.PEventStore
 - val eventsRDD = PEventStore.find(appName="test")(sc)
 - val c = eventsRDD.count() 

it shows event counts as 18944

After that from the script through which I uploaded the events, I randomly queried with there events Id and I was getting that event.

I don't know how to make sure that all the events uploaded by me are there in the app. Any help is appreciated.

Sameer Mahajan
  • 484
  • 1
  • 8
  • 27
Abhimanyu
  • 2,710
  • 2
  • 25
  • 42

1 Answers1

1

Finally I figure out what happened in

org.apache.predictionio.data.storage.hbase.HBPEvents

val scan = HBEventsUtil.createScan(
    startTime = startTime,
    untilTime = untilTime,
    entityType = entityType,
    entityId = entityId,
    eventNames = eventNames,
    targetEntityType = targetEntityType,
    targetEntityId = targetEntityId,
    reversed = None)
scan.setCaching(500) // TODO
scan.setCacheBlocks(false) // TODO

scan.setCaching(500) may cause request timeout. You can try lower caching value for this. You need change the source code and recompile.

decazuk
  • 136
  • 1
  • 5