I'm using a collection of BT tables to store data that's being used for both batch and realtime operations, and want to optimize performance, especially around latency of random access reads. And while I do know the underlying BT codebase fairly well, I don't know how all of that translates into best practices with Cloud Bigtable, which isn't quite the same as the underlying code. So I've got some questions for the experts:
(1) I've spotted answered in other questions that Cloud BT stores all column families in a single locality group. Since I often have to read data from multiple column families in a single row, this is great for my needs... but I'm noticing a significant slowdown when reading N CF's rather than one CF in an operation. In this case, each cell is small (~1kB) and the total number of cells being read isn't big, so I'm not expecting this to be dominated by network latency, bottlenecks, or the like; and the cells aren't being hammered on by writes, so I'm not expecting an uncontrolled uncompacted log. But:
- Are there any general performance tips for this type of read pattern?
- What are the major and minor compaction intervals used in cloud BT? Are these tunable?
(2) The read API does accept sparse sets of rows in the read request. How much optimization is happening under the hood with these? Is there some cloud BT server that I'm hitting within the instance which is parallelizing these underlying operations across tabletservers, or does the cloud BT API just go straight to the tabletservers? (Which is to say, is using this API indeed more efficient than doing for loops?)
(3) Related, I'm using the Python client library. Are there any things to know about its parallelization of operations, or its parallelizability -- e.g., any gotchas with using it from multiple threads?
(4) Anything else I should know about how to make random reads scream?
(Footnote for future readers of this question who don't know the innards of BT: you can think of the entire table as divided vertically into locality groups, the locality groups into column families, and the column families into columns, and horizontally into tablets, which contain rows. Each locality group basically operates like an independent bigtable under the hood, but in cloud BT all your families are in a single LG so this level of abstraction doesn't mean much. The horizontal split into tablets is done dynamically at regular intervals, to avoid hotspotting of tablets, so a single tablet may be as small as one row or as large as millions. Within each (locality group) * (tablet) rectangle of your table, the data is stored in the style of a journaling file system: there's a log file of recent writes (just "row, column, value" tuples, basically). Every minor compaction interval, a new log file is started, and the previous log file is converted into an SSTable, which is a file that stores a sorted map from string to string for efficient reads. Every major compaction interval, all the SSTables are combined into a single SSTable. So a single write to BT is just an append to the log, and a read has to check all the SSTables currently present, plus the log file. Thus if you're writing a lot to a tablet, reads on it get slower.
SSTables actually come in multiple wire formats which are optimized for various access patterns, like random access from spinning disk, batch access, and so on, so depending on those details reading one of those can take 1-3 iops against the underlying storage system, which is generally a distributed disk.)