I made some profiling for my MR job and found that fetching next records for table scan takes ~30% of time spent in mapper. As far as I understand, scanner fetches N rows from server as configured by scan.setCaching
and then iterates them locally.
Is there anything I can do to minimize cache load time? Is this a signal that scan was setup incorrectly? Current setup:
scan caching = 100
record size = ~5kb
cf block size = ~130kb, compression=gz
I thought of a custom table record reader that performs pre-fetching in background.