Questions tagged [bigtable]

Bigtable is a distributed storage system (built by Google) for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Bigtable

A Distributed Storage System for Structured Data

Bigtable is a distributed storage system (built by Google) for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving).

Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.

Some features

fast and extremely large-scale DBMS
a sparse, distributed multi-dimensional sorted map, sharing characteristics of both row-oriented and column-oriented databases.
designed to scale into the petabyte range
it works across hundreds or thousands of machines
it is easy to add more machines to the system and automatically start taking advantage of those resources without any reconfiguration
each table has multiple dimensions (one of which is a field for time, allowing versioning)
tables are optimized for GFS (Google File System) by being split into multiple tablets - segments of the table as split along a row chosen such that the tablet will be ~200 megabytes in size.

Architecture

BigTable is not a relational database. It does not support joins nor does it support rich SQL-like queries. Each table is a multidimensional sparse map. Tables consist of rows and columns, and each cell has a time stamp. There can be multiple versions of a cell with different time stamps. The time stamp allows for operations such as "select 'n' versions of this Web page" or "delete cells that are older than a specific date/time."

In order to manage the huge tables, Bigtable splits tables at row boundaries and saves them as tablets. A tablet is around 200 MB, and each machine saves about 100 tablets. This setup allows tablets from a single table to be spread among many servers. It also allows for fine-grained load balancing. If one table is receiving many queries, it can shed other tablets or move the busy table to another machine that is not so busy. Also, if a machine goes down, a tablet may be spread across many other servers so that the performance impact on any given machine is minimal.

Tables are stored as immutable SSTables and a tail of logs (one log per machine). When a machine runs out of system memory, it compresses some tablets using Google proprietary compression techniques (BMDiff and Zippy). Minor compactions involve only a few tablets, while major compactions involve the whole table system and recover hard-disk space.

The locations of Bigtable tablets are stored in cells. The lookup of any particular tablet is handled by a three-tiered system. The clients get a point to a META0 table, of which there is only one. The META0 table keeps track of many META1 tablets that contain the locations of the tablets being looked up. Both META0 and META1 make heavy use of pre-fetching and caching to minimize bottlenecks in the system.

Implementation

BigTable is built on Google File System (GFS), which is used as a backing store for log and data files. GFS provides reliable storage for SSTables, a Google-proprietary file format used to persist table data.

Another service that BigTable makes heavy use of is Chubby, a highly-available, reliable distributed lock service. Chubby allows clients to take a lock, possibly associating it with some metadata, which it can renew by sending keep alive messages back to Chubby. The locks are stored in a filesystem-like hierarchical naming structure.

There are three primary server types of interest in the Bigtable system:

Master servers: assign tablets to tablet servers, keeps track of where tablets are located and redistributes tasks as needed.
Tablet servers: handle read/write requests for tablets and split tablets when they exceed size limits (usually 100MB - 200MB). If a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
Lock servers: instances of the Chubby distributed lock service. Lots of actions within BigTable require acquisition of locks including opening tablets for writing, ensuring that there is no more than one active Master at a time, and access control checking.

API

Typical operations to BigTable are creation and deletion of tables and column families, writing data and deleting columns from a row. BigTable provides this functions to application developers in an API. Transactions are supported at the row level, but not across several row keys.

References

Whitepaper Bigtable: A Distributed Storage System for Structured Data
Whitepaper The Google File System
Whitepaper The Chubby lock service for loosely-coupled distributed systems

Related Tags

google-bigquery commercial version of BigTable
hbase open source implementation of BigTable

528 questions

votes

2 answers

Choosing a database type

When would you use a bigtabe/simpledb database vs a Relational database?

architecture relational amazon-simpledb bigtable

asked Sep 27 '08 at 16:59

Aaron Fischer

20,853
18
75
116

votes

1 answer

Column-family database sharding and replication [NoSQL Distilled]

In section 4.5 Combining Sharding and Replication of the NoSQL Distilled book, the following assertion is made: "Using peer-to-peer replication and sharding is a common strategy for column-family databases." The statement leaves out other types…

nosql cassandra replication sharding bigtable

asked Oct 11 '12 at 12:37

Duarte Nunes

votes

1 answer

Redis versus Cassandra(Bigtable data model)

Suppose I need to do the following operations intensively: put(key, value) where value is a map of . I havn’t known NoSQL for long, what I know is that both Cassandra insert(which conform the api defined in Bigtable paper)…

nosql redis cassandra bigtable

asked Sep 27 '12 at 09:05

realjin

1,485
1
19
38

votes

3 answers

Does app engine automatically cache frequent queries?

I seem to remember reading somewhere that google app engine automatically caches the results of very frequent queries into memory so that they are retrieved faster. Is this correct? If so, is there still a charge for datastore reads on these…

python google-app-engine memcached bigtable

asked Mar 13 '12 at 18:06

Chris Dutrow

48,402
65
188
258

votes

2 answers

Indexing business hours using google app engine

Given a time range in a day (business open hours) how could you create an index that would return a entity given the current time. I made the mistake of assuming a list property containing just the open and close hours could work on app engine. The…

database google-app-engine indexing hash bigtable

asked Mar 09 '12 at 17:30

scottzer0

votes

1 answer

NoSQL keyword search in huge table

I'm curios how can a NoSQL solution support keyword search in a very very big table distributed accross multiple servers? By keyword search I mean a DB like the one Google has, with huge amount of documents, and with the ability to answer such…

mongodb cassandra bigtable nosql

asked Jan 30 '12 at 07:33

diemacht

2,022
7
30
44

votes

2 answers

To shard or not to shard? GAE/java/jdo

I'm currently porting some work from MySQL to Google App Engine/Java. I'm using JDO, as well as the lower level java API where required. I read through the optimization guide about sharding counters:…

java google-app-engine sharding bigtable

asked Sep 27 '11 at 03:50

Dave

6,141
2
38
65

votes

1 answer

Multilingual website with not relational database

Can anyone help with advice how to organize not relational database for multilingual site? Here it is some questions about this stuff but with MySQL, etc. "multilanguage" is not only static (we can do this with framework), but with dynamic content…

java database-design nosql multilingual bigtable

asked Aug 10 '11 at 14:42

Donotello

votes

1 answer

Janus Graph backend cassandra vs Bigtable

I am planning to use Janusgraph for building graph of different uses our team handles and I see that janus graph has option to use BigTable or Cassandra as storage backend. I am looking for any recommendation on which backend is more…

cassandra janusgraph bigtable

asked Jun 09 '21 at 19:52

Vishal

votes

2 answers

How to represent one-to-one relationship in App Engine

Say you have a concept of "user" records that you'd like to store in the data store. class User (db.Model): first_name = db.StringProperty() last_name = db.StringProperty() created = db.DateTimeProperty(auto_now_add=True) twitter_oauth_token…

python google-app-engine bigtable schema-design

asked Jun 07 '11 at 16:22

ʞɔıu

47,148
35
106
149

votes

1 answer

aggregation operation in cloud bigtable

I was going through BT documentation. Learned that data is stored in a column for a column family and accessed via row key. I want to understand if aggregation(such as count, sum) can be done by BT? As Cassandra or Scylla DB share a similar data…

google-cloud-bigtable bigtable

asked Jun 18 '20 at 11:47

deep

votes

1 answer

Why HBase rows are said to be stored as lexicographically sorted?

Based on the HBase documentation, again following the reference from the Google BigTable paper, the rows are said to be stored with lexicographic sorting of the row key. It is evident that the rows are sorted lexicographically when we have a string…

hbase bigtable lexicographic row-key

asked May 17 '20 at 17:47

Betta

votes

2 answers

How to add limit options while fetching data from bigTable ? Can someone give me the proper syntax to do so in NodeJS

Currenlty I am doing like this var [rowData] = await table.row(key).get({limit: 2}); Still am getting the 4 results instead of 2.

node.js google-cloud-platform google-cloud-functions google-cloud-bigtable bigtable

asked Apr 01 '20 at 15:46

Sumeet.Jain

1,533
9
26

votes

1 answer

Cloud Bigtable minimum recommended table size

According to the Cloud Bigtable performance docs I should have a certain amount of data to ensure the highest throughput. Under "Causes of slower performance" it says: The workload isn't appropriate for Cloud Bigtable. If you test with a small…

google-cloud-bigtable bigtable

asked Oct 30 '19 at 17:54

alasarr

1,565
3
16
32

votes

2 answers

Performance of cloud bigtable row filtering

What is happening on the bigtable server when you issue a prefix scan with row filtering? Say you perform a prefix scan using filtering and as time goes on, more rows end up getting filtered out. I'm wondering if performance becomes degraded due to…

google-cloud-platform google-cloud-bigtable bigtable

asked Oct 10 '19 at 14:50

Ian Herbert

1,071
2
16
35

Prev 1 2 3

…

35 36 Next