Questions tagged [bigtable]

Bigtable is a distributed storage system (built by Google) for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Bigtable

A Distributed Storage System for Structured Data

Bigtable is a distributed storage system (built by Google) for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving).

Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.

Some features

  • fast and extremely large-scale DBMS
  • a sparse, distributed multi-dimensional sorted map, sharing characteristics of both row-oriented and column-oriented databases.
  • designed to scale into the petabyte range
  • it works across hundreds or thousands of machines
  • it is easy to add more machines to the system and automatically start taking advantage of those resources without any reconfiguration
  • each table has multiple dimensions (one of which is a field for time, allowing versioning)
  • tables are optimized for GFS (Google File System) by being split into multiple tablets - segments of the table as split along a row chosen such that the tablet will be ~200 megabytes in size.

Architecture

BigTable is not a relational database. It does not support joins nor does it support rich SQL-like queries. Each table is a multidimensional sparse map. Tables consist of rows and columns, and each cell has a time stamp. There can be multiple versions of a cell with different time stamps. The time stamp allows for operations such as "select 'n' versions of this Web page" or "delete cells that are older than a specific date/time."

In order to manage the huge tables, Bigtable splits tables at row boundaries and saves them as tablets. A tablet is around 200 MB, and each machine saves about 100 tablets. This setup allows tablets from a single table to be spread among many servers. It also allows for fine-grained load balancing. If one table is receiving many queries, it can shed other tablets or move the busy table to another machine that is not so busy. Also, if a machine goes down, a tablet may be spread across many other servers so that the performance impact on any given machine is minimal.

Tables are stored as immutable SSTables and a tail of logs (one log per machine). When a machine runs out of system memory, it compresses some tablets using Google proprietary compression techniques (BMDiff and Zippy). Minor compactions involve only a few tablets, while major compactions involve the whole table system and recover hard-disk space.

The locations of Bigtable tablets are stored in cells. The lookup of any particular tablet is handled by a three-tiered system. The clients get a point to a META0 table, of which there is only one. The META0 table keeps track of many META1 tablets that contain the locations of the tablets being looked up. Both META0 and META1 make heavy use of pre-fetching and caching to minimize bottlenecks in the system.

Implementation

BigTable is built on Google File System (GFS), which is used as a backing store for log and data files. GFS provides reliable storage for SSTables, a Google-proprietary file format used to persist table data.

Another service that BigTable makes heavy use of is Chubby, a highly-available, reliable distributed lock service. Chubby allows clients to take a lock, possibly associating it with some metadata, which it can renew by sending keep alive messages back to Chubby. The locks are stored in a filesystem-like hierarchical naming structure.

There are three primary server types of interest in the Bigtable system:

  1. Master servers: assign tablets to tablet servers, keeps track of where tablets are located and redistributes tasks as needed.
  2. Tablet servers: handle read/write requests for tablets and split tablets when they exceed size limits (usually 100MB - 200MB). If a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
  3. Lock servers: instances of the Chubby distributed lock service. Lots of actions within BigTable require acquisition of locks including opening tablets for writing, ensuring that there is no more than one active Master at a time, and access control checking.

API

Typical operations to BigTable are creation and deletion of tables and column families, writing data and deleting columns from a row. BigTable provides this functions to application developers in an API. Transactions are supported at the row level, but not across several row keys.

References

Related Tags

528 questions
0
votes
1 answer

What NO-SQL solution is optimised for frequent view updates & recreation?

I am having a situation where every newly entered data set to a table prompts the re-creation of a number of views. I am currently trying CouchDB, but would appreciate feedback about other database solutions. Description: The table includes a number…
0
votes
1 answer

No data relationships in big table databases?

If the relationships between the data are as important as the data itself (such as distance or path calculations), then don't use a column family/big table database. (Quoted from article Big data woes: Which database should I use? by Andrew…
Dominykas Mostauskis
  • 7,797
  • 3
  • 48
  • 67
0
votes
2 answers

how GAE DataStore support transaction?

As I known, DataStore is implemented based on bigtable, and transaction only support in single entity group or maximum 5 cross entity groups, but IMHO bigtable only support single row transaction, Entities in the same entity group will be inserted…
pythonee
  • 914
  • 4
  • 12
  • 17
0
votes
1 answer

How to get thespecific entity from google app engine datastore according to the specific button click

My Jsp file is like this. It has 2 buttons. When i click on 1st button then 1st entity from the datastore should be selected and same with the 2nd button. <%@ page language="java" contentType="text/html; charset=ISO-8859-1" …
Sandeep
  • 173
  • 2
  • 3
  • 13
0
votes
1 answer

accumulo - batchscanner: one result per range

So my general question is "Is it possible to have an Accumulo BatchScanner only pull back the first result per Range I give it?" Now some details about my use case as there may be a better way to approach this anyway. I have data that represent…
jeff
  • 4,325
  • 16
  • 27
0
votes
1 answer

Detecting if something exists and then automating process

I was wondering if it would be possible to write a script in PHP which would proceed through an extremely large data set (100 million+) to try locate specific strings within the data set? If it is feasibly possible would it be an efficient form of…
Ciaran
  • 1,139
  • 2
  • 11
  • 14
0
votes
2 answers

How should I find the rows with a duplicate field in a big table?

I have a table with 1.5M+ rows for recording downloads from a website which has email address of the one who has downloaded something. I want to find those who have downloaded more than 100 times. This is what I have tested but the query-time is…
SAVAFA
  • 818
  • 8
  • 23
0
votes
1 answer

Getting data into Hadoop

I come from a lot of SQL servers so it can be a bit difficult to picture exactly what happens to data when it goes into hadoop. My understanding is that if you have a book in a text format that could be around 200k or so... you simply copy the data…
0
votes
1 answer

iphone table view delete entry and update app engine db

I have a tableview with data, that i post to the app engine database. Whenever i delete an entry in the table, i want to delelte the item in the app engine database as well. How do i know which entry to delete? I was thinking of this: for every item…
Ton
  • 365
  • 4
  • 19
0
votes
2 answers

Move or copy an entity to another kind

Is there a way to move an entity to another kind in appengine. Say you have a kind defines, and you want to keep a record of deleted entities of that kind. But you want to separate the storage of live object and archived objects. Kinds are basically…
Johan Carlsson
  • 743
  • 3
  • 11
  • 24
0
votes
3 answers

how to convert a relational database to one Bigtable

I want to create one big table contains all the data from all table in database then export this table into csv file then import this file into Hbase ? My issue is first step which is how to create bigtable from all database tables? i will be…
Samy Louize Hanna
  • 821
  • 2
  • 8
  • 15
0
votes
1 answer

Is modeling infinite-scale relationships in NoSQL / BigTable (GAE) possible?

My team is writing an application with GAE (Java) that has led me to question the scalability of entity relationship modeling (specifically many-to-many) in object oriented databases like BigTable. The preferred solution for modeling unowned…
0
votes
1 answer

About the relationship in two entities

I'm thinking to create a property which store the key or the ID of the other entity as a reference to the entity. I want to know two things. 1. Which data should the property store, the key or the ID? 2. What should the type of the property be?…
Nigiri
  • 3,469
  • 6
  • 29
  • 52
0
votes
1 answer

Initializing Bigtable with test data

My app has a Google App engine back end which uses BigTable for it's persistence. I have some functional tests I want to run which are dependent on existing Test data being preloaded in the database. What is the best way to preload this data as I…
MayoMan
  • 4,757
  • 10
  • 53
  • 85
0
votes
1 answer

How to store indexable list/collection in an appengine entity?

After creating an entity: DatastoreService datastore = DatastoreServiceFactory.getDatastoreService(); Entity employee = new Entity("Employee"); How to set an index-able list property? like say: employee.setProperty("tag", "manager", "corrupted",…