19

We've got a file-based program we want to convert to use a document database, specifically MongoDB. Problem is, MongoDB is limited to 2GB on 32-bit machines (according to http://www.mongodb.org/display/DOCS/FAQ#FAQ-Whatarethe32bitlimitations%3F), and a lot of our users will have over 2GB of data. Is there a way to have MongoDB use more than one file somehow?

I thought perhaps I could implement sharding on a single machine, meaning I'd run more than one mongod on the same machine and they'd somehow communicate. Could that work?

configurator
  • 40,828
  • 14
  • 81
  • 115
  • This is the biggest limit/issue with MongoDB for me to use it in a new project! What a pity!!! – Edwin Yip Dec 07 '10 at 16:33
  • @Edwin: Sharding does solve the problem quite elegantly, if you know how big your database will be in advance. – configurator Dec 07 '10 at 21:52
  • 2
    Seriously, a bounty on a q regarding 32-bit machines? Four years later, 32-bit machines are mostly found in Museums. Most *cellphones* come with 64-bits these days – mnemosyn Jun 04 '15 at 13:16
  • 1
    As @mnemosyn eloquently states, running server processes on 32-bit architectures should be more or less irrelevant, if though officially supported. Basing on the accepted answer here, sharding is the relevant way, and as opposed to comments at the time is of course supported. However, while you can run multiple processes on a single machine, there is only so much memory that can be used, thus making such a choice not very efficient or effective. –  Jun 06 '15 at 10:09

3 Answers3

7

The only way to have more than 2GB on a single node is to run multiple mongod processes. So sharding is one option (like you said) or doing some manual partitioning across processes.

mdirolf
  • 7,521
  • 2
  • 23
  • 15
  • Would sharding by running multiple processes on a single machine even work though? – configurator Jun 07 '10 at 13:36
  • @mdirolf How could increased number of mongod processes (on one physical server) change the picture, if 32-bit OS still can address only a limited amount of memory? Shrading may help, if only the shards are located on different hosts (but total storage size for the server can't exceed the 2GB limit anyway). – Vasil Remeniuk Jun 07 '10 at 13:52
  • 3
    I think the problem in using memory-mapped file is that a process with 32-bit pointers can't point to data beyond that range - not that the OS can't open files. – configurator Jun 07 '10 at 14:03
  • Yep, that's what I meant. It's about the RAM that can be addressed by OS (on 32-bit systems it's limited to 4gb, AFAIK). – Vasil Remeniuk Jun 07 '10 at 14:26
  • Using memory-mapped files doesn't mean the entire file needs to be in RAM - it just means you only get the 32-bit address space inside the file. – configurator Jun 07 '10 at 15:43
  • 1
    @Vasil yup, exactly what configurator said - each process has its own address space so if you have multiple processes each will be able to address ~2.5GB. – mdirolf Jun 08 '10 at 04:23
  • Thank you both, guys. I've glanced over some articles about memory-mapped files and virtual memory, and now it's much more clear. I agree that theoretically it should be possible to start several instances of MongoDB to share one virtual memory space. What confuses me, is that MongoDB wiki doesn't say a word about sharding as a way to workaround 32-bit limitations, and as for me, it means that it's, at least, not the setup recommended for production use. The other question is, how many instances of MongoDB can work effectively sharing resources of one machine? 2GB is a very low threshold... – Vasil Remeniuk Jun 08 '10 at 06:58
  • 4
    Yeah we don't recommend it because it's probably overly complex (and sharding isn't in a production release yet). The recommendation is really just to find a 64 bit machine to deploy on (I know this isn't an option for some people, though). – mdirolf Jun 08 '10 at 14:16
0

You could configure sharding because 2Gb limit only applies to individual mongodb processes. Pls refer the documentation sharded-clusters,and I also found Python Script to set-up sharded environment on a single machine.

#!/usr/bin/python2

import os
import sys
import shutil
import pymongo
import atexit

from socket import error, socket, AF_INET, SOCK_STREAM
from select import select
from subprocess import Popen, PIPE, STDOUT
from threading import Thread
from time import sleep

try:
    # new pymongo
    from bson.son import SON
except ImportError:
    # old pymongo
    from pymongo.son import SON

# BEGIN CONFIGURATION

# some settings can also be set on command line. start with --help to see options

BASE_DATA_PATH='/data/db/sharding/' #warning: gets wiped every time you run this
MONGO_PATH=os.getenv( "MONGO_HOME" , os.path.expanduser('~/10gen/mongo/') )
N_SHARDS=3
N_CONFIG=1 # must be either 1 or 3
N_MONGOS=1
CHUNK_SIZE=64 # in MB (make small to test splitting)
MONGOS_PORT=27017 if N_MONGOS == 1 else 10000 # start at 10001 when multi
USE_SSL=False # set to True if running with SSL enabled

CONFIG_ARGS=[]
MONGOS_ARGS=[]
MONGOD_ARGS=[]

# Note this reports a lot of false positives.
USE_VALGRIND=False
VALGRIND_ARGS=["valgrind", "--log-file=/tmp/mongos-%p.valgrind", "--leak-check=yes", 
               ("--suppressions="+MONGO_PATH+"valgrind.suppressions"), "--"]

# see http://pueblo.sourceforge.net/doc/manual/ansi_color_codes.html
CONFIG_COLOR=31 #red
MONGOS_COLOR=32 #green
MONGOD_COLOR=36 #cyan
BOLD=True

# defaults -- can change on command line
COLLECTION_KEYS = {'foo' : '_id', 'bar': 'key', 'foo2' : 'a,b' }

def AFTER_SETUP():
    # feel free to change any of this
    # admin and conn are both defined globaly
    admin.command('enablesharding', 'test')

    for (collection, keystr) in COLLECTION_KEYS.iteritems():
        key=SON((k,1) for k in keystr.split(','))
        admin.command('shardcollection', 'test.'+collection, key=key)

    admin.command('shardcollection', 'test.fs.files', key={'_id':1})
    admin.command('shardcollection', 'test.fs.chunks', key={'files_id':1})


# END CONFIGURATION

for x in sys.argv[1:]:
    opt = x.split("=", 1)
    if opt[0] != '--help' and len(opt) != 2:
        raise Exception("bad arg: " + x )

    if opt[0].startswith('--'):
        opt[0] = opt[0][2:].lower()
        if opt[0] == 'help':
            print sys.argv[0], '[--help] [--chunksize=200] [--port=27017] [--path=/where/is/mongod] [collection=key]'
            sys.exit()
        elif opt[0] == 'chunksize':
            CHUNK_SIZE = int(opt[1])
        elif opt[0] == 'port':
            MONGOS_PORT = int(opt[1])
        elif opt[0] == 'path':
            MONGO_PATH = opt[1]
        elif opt[0] == 'usevalgrind': #intentionally not in --help
            USE_VALGRIND = int(opt[1])
        else:
            raise( Exception("unknown option: " + opt[0] ) )
    else:
        COLLECTION_KEYS[opt[0]] = opt[1]

if MONGO_PATH[-1] != '/':
    MONGO_PATH = MONGO_PATH+'/'

print( "MONGO_PATH: " + MONGO_PATH )

if not USE_VALGRIND:
    VALGRIND_ARGS = []

# fixed "colors"
RESET = 0
INVERSE = 7

if os.path.exists(BASE_DATA_PATH):
    print( "removing tree: %s" % BASE_DATA_PATH )
    shutil.rmtree(BASE_DATA_PATH)

mongod = MONGO_PATH + 'mongod'
mongos = MONGO_PATH + 'mongos'

devnull = open('/dev/null', 'w+')

fds = {}
procs = []

def killAllSubs():
    for proc in procs:
        try:
            proc.terminate()
        except OSError:
            pass #already dead
atexit.register(killAllSubs)

def mkcolor(colorcode): 
    base = '\x1b[%sm'
    if BOLD:
        return (base*2) % (1, colorcode)
    else:
        return base % colorcode

def ascolor(color, text):
    return mkcolor(color) + text + mkcolor(RESET)

def waitfor(proc, port):
    trys = 0
    while proc.poll() is None and trys < 40: # ~10 seconds
        trys += 1
        s = socket(AF_INET, SOCK_STREAM)
        try:
            try:
                s.connect(('localhost', port))
                return
            except (IOError, error):
                sleep(0.25)
        finally:
            s.close()

    #extra prints to make line stand out
    print
    print proc.prefix, ascolor(INVERSE, 'failed to start')
    print

    sleep(1)
    killAllSubs()
    sys.exit(1)


def printer():
    while not fds: sleep(0.01) # wait until there is at least one fd to watch

    while fds:
        (files, _ , errors) = select(fds.keys(), [], fds.keys(), 1)
        for file in set(files + errors):
            # try to print related lines together
            while select([file], [], [], 0)[0]:
                line = file.readline().rstrip()
                if line:
                    print fds[file].prefix, line
                else:
                    if fds[file].poll() is not None:
                        print fds[file].prefix, ascolor(INVERSE, 'EXITED'), fds[file].returncode
                        del fds[file]
                        break
                break

printer_thread = Thread(target=printer)
printer_thread.start()


configs = []
for i in range(1, N_CONFIG+1):
    path = BASE_DATA_PATH +'config_' + str(i)
    os.makedirs(path)
    config = Popen([mongod, '--port', str(20000 + i), '--configsvr', '--dbpath', path] + CONFIG_ARGS, 
                   stdin=devnull, stdout=PIPE, stderr=STDOUT)
    config.prefix = ascolor(CONFIG_COLOR, 'C' + str(i)) + ':'
    fds[config.stdout] = config
    procs.append(config)
    waitfor(config, 20000 + i)
    configs.append('localhost:' + str(20000 + i))


for i in range(1, N_SHARDS+1):
    path = BASE_DATA_PATH +'shard_' + str(i)
    os.makedirs(path)
    shard = Popen([mongod, '--port', str(30000 + i), '--shardsvr', '--dbpath', path] + MONGOD_ARGS,
                  stdin=devnull, stdout=PIPE, stderr=STDOUT)
    shard.prefix = ascolor(MONGOD_COLOR, 'M' + str(i)) + ':'
    fds[shard.stdout] = shard
    procs.append(shard)
    waitfor(shard, 30000 + i)


#this must be done before starting mongos
for config_str in configs:
    host, port = config_str.split(':')
    config = pymongo.Connection(host, int(port), ssl=USE_SSL).config
    config.settings.save({'_id':'chunksize', 'value':CHUNK_SIZE}, safe=True)
del config #don't leave around connection directly to config server

if N_MONGOS == 1:
    MONGOS_PORT -= 1 # added back in loop

for i in range(1, N_MONGOS+1):
    router = Popen(VALGRIND_ARGS + [mongos, '--port', str(MONGOS_PORT+i), '--configdb' , ','.join(configs)] + MONGOS_ARGS,
                   stdin=devnull, stdout=PIPE, stderr=STDOUT)
    router.prefix = ascolor(MONGOS_COLOR, 'S' + str(i)) + ':'
    fds[router.stdout] = router
    procs.append(router)

    waitfor(router, MONGOS_PORT + i)

conn = pymongo.Connection('localhost', MONGOS_PORT + 1, ssl=USE_SSL)
admin = conn.admin

for i in range(1, N_SHARDS+1):
    admin.command('addshard', 'localhost:3000'+str(i), allowLocal=True)

AFTER_SETUP()

# just to be safe
sleep(2)

print '*** READY ***'
print 
print 

try:
    printer_thread.join()
except KeyboardInterrupt:
    pass
Heisenberg
  • 865
  • 10
  • 26
-2

The best way is managing the virtual storage of MongoDB documents.

The MongoDB's storage limit on different operating systems are tabulated below as per the MongoDB 3.0 MMAPv1 storage engine limits.

The MMAPv1 storage engine limits each database to no more than 16000 data files. This means that a single MMAPv1 database has a maximum size of 32TB. Setting the storage.mmapv1.smallFiles option reduces this limit to 8TB.

Using the MMAPv1 storage engine, a single mongod instance cannot manage a data set that exceeds maximum virtual memory address space provided by the underlying operating system.

                            Virtual Memory Limitations

Operating System           Journaled                Not Journaled

   Linux                 64 terabytes               128 terabytes

Windows Server 2012 R2
and Windows 8.1          64 terabytes               128 terabytes

Windows (otherwise)       4 terabytes                8 terabytes

Reference: MongoDB Database Limit.

Note:The WiredTiger storage engine is not subject to this limitation.

Hope This helps.

SUNDARRAJAN K
  • 2,237
  • 2
  • 22
  • 38