Questions tagged [bigtable]

Bigtable is a distributed storage system (built by Google) for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Bigtable

A Distributed Storage System for Structured Data

Bigtable is a distributed storage system (built by Google) for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving).

Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.

Some features

  • fast and extremely large-scale DBMS
  • a sparse, distributed multi-dimensional sorted map, sharing characteristics of both row-oriented and column-oriented databases.
  • designed to scale into the petabyte range
  • it works across hundreds or thousands of machines
  • it is easy to add more machines to the system and automatically start taking advantage of those resources without any reconfiguration
  • each table has multiple dimensions (one of which is a field for time, allowing versioning)
  • tables are optimized for GFS (Google File System) by being split into multiple tablets - segments of the table as split along a row chosen such that the tablet will be ~200 megabytes in size.

Architecture

BigTable is not a relational database. It does not support joins nor does it support rich SQL-like queries. Each table is a multidimensional sparse map. Tables consist of rows and columns, and each cell has a time stamp. There can be multiple versions of a cell with different time stamps. The time stamp allows for operations such as "select 'n' versions of this Web page" or "delete cells that are older than a specific date/time."

In order to manage the huge tables, Bigtable splits tables at row boundaries and saves them as tablets. A tablet is around 200 MB, and each machine saves about 100 tablets. This setup allows tablets from a single table to be spread among many servers. It also allows for fine-grained load balancing. If one table is receiving many queries, it can shed other tablets or move the busy table to another machine that is not so busy. Also, if a machine goes down, a tablet may be spread across many other servers so that the performance impact on any given machine is minimal.

Tables are stored as immutable SSTables and a tail of logs (one log per machine). When a machine runs out of system memory, it compresses some tablets using Google proprietary compression techniques (BMDiff and Zippy). Minor compactions involve only a few tablets, while major compactions involve the whole table system and recover hard-disk space.

The locations of Bigtable tablets are stored in cells. The lookup of any particular tablet is handled by a three-tiered system. The clients get a point to a META0 table, of which there is only one. The META0 table keeps track of many META1 tablets that contain the locations of the tablets being looked up. Both META0 and META1 make heavy use of pre-fetching and caching to minimize bottlenecks in the system.

Implementation

BigTable is built on Google File System (GFS), which is used as a backing store for log and data files. GFS provides reliable storage for SSTables, a Google-proprietary file format used to persist table data.

Another service that BigTable makes heavy use of is Chubby, a highly-available, reliable distributed lock service. Chubby allows clients to take a lock, possibly associating it with some metadata, which it can renew by sending keep alive messages back to Chubby. The locks are stored in a filesystem-like hierarchical naming structure.

There are three primary server types of interest in the Bigtable system:

  1. Master servers: assign tablets to tablet servers, keeps track of where tablets are located and redistributes tasks as needed.
  2. Tablet servers: handle read/write requests for tablets and split tablets when they exceed size limits (usually 100MB - 200MB). If a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
  3. Lock servers: instances of the Chubby distributed lock service. Lots of actions within BigTable require acquisition of locks including opening tablets for writing, ensuring that there is no more than one active Master at a time, and access control checking.

API

Typical operations to BigTable are creation and deletion of tables and column families, writing data and deleting columns from a row. BigTable provides this functions to application developers in an API. Transactions are supported at the row level, but not across several row keys.

References

Related Tags

528 questions
4
votes
2 answers

BigTable: When should I enable Single-Row Transaction?

Cloud Bigtable docs on Single-row Transactions says: Cloud Bigtable also supports some write operations that would require a transaction in other databases: Read-modify-write operations, including increments and appends. A read-modify-write…
Gabriel
  • 809
  • 1
  • 10
  • 21
4
votes
1 answer

Difference between google appengine and actual big table

I know that app engine is implemented on big table, can anyone describe the difference between actual implementation of big table and google's implementation of big table .i.e (App engine)
Abdul Kader
  • 5,781
  • 4
  • 22
  • 40
4
votes
1 answer

Connect to Bigtable emulator from localhost with Node.js client

Trying to connect to Cloud Bigtable emulator from localhost. Saw couple of posts on how to connect to localhost Bigtable emulator with Java. There is no documentation that specifies how to do so with Node.js. @google-cloud/bigtable client needs…
4
votes
1 answer

Why does BigTable have column families?

Why is BigTable structured as a two-level hierarchy of "family:qualifier"? Specifically, why is this enforced rather than just having columns and, say, recommending that users name their qualifiers "vertical:column"? I am interested in whether or…
user3038457
  • 185
  • 2
  • 11
4
votes
2 answers

Transactional counter with 5+ writes per second in Google App Engine datastore

I'm developing a tournament version of a game where I expect 1000+ simultaneous players. When the tournament begins, players will be eliminated quite fast (possibly more than 5 per second), but the process will slow down as the tournament…
jaz
  • 121
  • 1
  • 2
  • 10
4
votes
2 answers

DB benchmarks: Cassandra vs. BigTable vs. Hadoop(s)

I am looking to evaluate the possibility of using Cassandra, BigTable, or a Hadoop-solution. Are there any places that have an up-to-date comparison on how these three compare and perform on a set of benchmark tests? I found a few from perhaps five…
David542
  • 104,438
  • 178
  • 489
  • 842
4
votes
1 answer

full-text search on bigtable

any insight as to making/optimizing full-text searches on bigtable using java? best practices and such? how do u guys do it?
Devrim
  • 2,826
  • 5
  • 25
  • 31
4
votes
1 answer

App Engine BadValueError On Bulk Data Upload - TextProperty being construed as StringProperty

bulkoader.yaml: transformers: - kind: ExampleModel connector: csv property_map: - property: __key__ external_name: key export_transform: transform.key_id_or_name_as_string - property: data…
4
votes
2 answers

Google Cloud Bigtable authentication with Go

I'm trying to insert a simple record as in GoDoc. But this returns, rpc error: code = 7 desc = "User can't access project: tidy-groove" When I searched for grpc codes, it says.. PermissionDenied Code = 7 // Unauthenticated indicates the request…
PrasadJay
  • 123
  • 8
4
votes
1 answer

Composite partition key (Cassandra) vs. interleaved indexes (Accumulo, BigTable) for time-spatial series

I'm working on a project in which we import 50k - 100k datapoints every day, located both temporally (YYYYMMDDHHmm) and spatially (lon, lat), which we then dynamically render onto maps according to the query parameters set by our users. We do use…
Jacoscaz
  • 159
  • 1
  • 7
4
votes
1 answer

Cannot connect from Titan to Google Bigtable via Hbase client

I am trying to connect to Titan 1.0.0 with Hadoop 2 (HBase 1.0.2 client) (available in https://github.com/thinkaurelius/titan/wiki/Downloads) with Google Cloud Bigtable service, using its HBase client. I could successfully connect to Bigtable from…
4
votes
1 answer

Google Appengine: Is This a Good set of Entity Groups?

I am trying to wrap my head around Entity Groups in Google AppEngine. I understand them in general, but since it sounds like you can not change the relationships once the object is created AND I have a big data migration to do, I want to try to get…
4
votes
2 answers

Is HBase meaningful if it's not running in a distributed environment?

I'm building an index of data, which will entail storing lots of triplets in the form (document, term, weight). I will be storing up to a few million such rows. Currently I'm doing this in MySQL as a simple table. I'm storing the document and term…
Joe
  • 46,419
  • 33
  • 155
  • 245
4
votes
1 answer

HBase and Bigtable support single-row transactions

What does it mean that HBase and Google's Bigtable both support single-row transactions but not multi-row? Currently I am using HBase on top of my local file system; how can I see this practically?
Rohit
  • 635
  • 6
  • 12
  • 22
4
votes
3 answers

ndb.query.count() failed with 60s query deadline on large entities

For 100k+ entities in google datastore, ndb.query().count() is going to cancelled by deadline , even with index. I've tried with produce_cursors options but only iter() or fetch_page() will returns cursor but count() doesn't. How can I count large…
Ray Yun
  • 1,571
  • 3
  • 16
  • 22