2

I made some profiling for my MR job and found that fetching next records for table scan takes ~30% of time spent in mapper. As far as I understand, scanner fetches N rows from server as configured by scan.setCaching and then iterates them locally.

enter image description here

Is there anything I can do to minimize cache load time? Is this a signal that scan was setup incorrectly? Current setup:

scan caching = 100
record size = ~5kb
cf block size = ~130kb, compression=gz

I thought of a custom table record reader that performs pre-fetching in background.

AdamSkywalker
  • 11,408
  • 3
  • 38
  • 76
  • its sounds scan caching = 100 is quite reasonable and pls. verify your scan... can you paste your scan statement sample and table structure here ? if you have column value filters it will take some time to match the value. if its row based, it will be faster. with same scan caching size I could able to retry around 5mb data per record. I suspect, the way you scan is the culprit. – Ram Ghadiyaram Mar 08 '17 at 09:53
  • Maintain records size * caching = 1MB. is 5kb uncompressed size of records ? – KrazyGautam Apr 19 '17 at 16:28

0 Answers0