2

Background:

I'm attempting to follow a tutorial in which I'm importing a CSV file that's approximately 324MB

enter image description here

to MongoLab's sandbox plan (capped at 500MB), via pymongo in Python 3.4.

The file holds ~ 770,000 records, and after inserting ~ 164,000 I hit my quota and received:

raise OperationFailure(error.get("errmsg"), error.get("code"), error)

OperationFailure: quota exceeded 

Question:

Would it be accurate to say the JSON-like structure of NoSQL takes more space to hold the same data as a CSV file? Or am I doing something screwy here?

Further information:

Here are the database metrics:

enter image description here

Here's the Python 3.4 code I used:

import sys
import pymongo
import csv


MONGODB_URI = '***credentials removed***'


def main(args):

    client = pymongo.MongoClient(MONGODB_URI)

    db = client.get_default_database()

    projects = db['projects']

    with open('opendata_projects.csv') as f:
        records = csv.DictReader(f)
        projects.insert(records)

    client.close()


if __name__ == '__main__':
    main(sys.argv[1:])
Chuck
  • 998
  • 8
  • 17
  • 30

2 Answers2

2

Yes, JSON takes up much more space than CSV. Here's an example:

name,age,job
Joe,35,manager
Fred,47,CEO
Bob,23,intern
Edgar,29,worker

translated in JSON, it would be:

[
    {
        "name": "Joe",
        "age": 35,
        "job": "manager"
    },
    {
        "name": "Fred",
        "age": 47,
        "job": "CEO"
    },
    {
        "name": "Bob",
        "age": 23,
        "job": "intern"
    },
    {
        "name": "Edgar",
        "age": 29,
        "job": "worker"
    }
]

Even with all whitespace removed, the JSON is 158 characters, while the CSV is only 69 characters.

MattDMo
  • 100,794
  • 21
  • 241
  • 231
1

Not accounting for things like compression, a set of json documents would take up more space than a csv, because the field names are repeated in each record, whereas in the csv the field names are only in the first row.

The way files are allocated is another factor:

In the filesize section of the Database Metrics screenshot you attached, notice that it says that the first file allocated is 16MB, then the next one is 32MB, and so on. So when your data grew past 240MB total, you had 5 files, of 16MB, 32MB, 64MB, 128MB, and 256MB. This explains why your filesize total is 496MB, even though your data size is only about 317MB. The next file that would be allocated would be 512MB, which would put you way past the 500MB limit.

1.618
  • 1,765
  • 16
  • 26