How can you store and modify large datasets in node.js?

Question

Basics

So basically I have written a program which generates test data for MongoDB in Node.

The problem

For that, the program reads a schema file and generates a specified amount of test data out of it. The problem is that this data can eventually become quite big (Think about creating 1M Users (with all properties it needs) and 20M chat messages (with userFrom and userTo) and it has to keep all of that in the RAM to modify/transform/map it and after that save it to a file.

How it works

The program works like that:

Read schema file
Create test data from the schema and store it in a structure (look down below for the structure)
Run through this structure and link all objects referenceTo to a random object with matching referenceKey.
Transform the object structure in a string[] of MongoDB insert statements
Store that string[] in a file.

This is the structure of the generated test data:

export interface IGeneratedCollection {
    dbName: string,                 // Name of the database
    collectionName: string,         // Name of the collection
    documents: IGeneratedDocument[] // One collection has many documents
}

export interface IGeneratedDocument {
    documentFields: IGeneratedField [] // One document has many fields (which are recursive, because of nested documents)
}

export interface IGeneratedField {
    fieldName: string, // Name of the property
    fieldValue: any,   // Value of the property (Can also be IGeneratedField, IGeneratedField[], ...)
    fieldNeedsQuotations?: boolean, // If the Value needs to be saved with " ... "
    fieldIsObject?: boolean,        // If the Value is a object (stored as IGeneratedField[]) (To handle it different when transforming to MongoDB inserts)
    fieldIsJsonObject?: boolean,    // If the Value is a plain JSON object
    fieldIsArray?: boolean,         // If the Value is array of objects (stored as array of IGeneratedField[])
    referenceKey?: number,          // Field flagged to be a key
    referenceTo?: number            // Value gets set to a random object with matching referenceKey
}

Actual data

So in the example with 1M Users and 20M messages it would look like this:

1x IGeneratedCollection (collectionName = "users")
- 1Mx IGeneratedDocument
  - 10x IGeneratedField (For example each user has 10 fields)
1x IGeneratedCollection (collectionName = "messages")
- 20Mx IGeneratedDocument
  - 3x IGeneratedField (message, userFrom, userTo)

hich would result in 190M instances of IGeneratedField (1x1Mx10 + 1x20Mx3x = 190M).

Conclusion

This is obviously a lot to handle for the RAM as it needs to store all of that at the same time.

Temporary Solution

It now works like that:

Generate 500 documents(rows in sql) at a time

JSON.stringify those 500 documents and put them in a SQLite table with the schema (dbName STRING, collectionName STRING, value JSON)

Remove those 500 documents from JS and let the Garbage Collector do its thing

Repeat until all data is generated and in the SQLite table

Take one of the rows (each containing 500 documents) at a time, apply JSON.parse and search for keys in them

Repeat until all data is queried and all keys retrieved

Take one of the rows at a time, apply JSON.parse and search for key references in them

Apply JSON.stringify and update the row if necessary (if key references found and resolved)

Repeat until all data is queried and all keys are resolved

Take one of the rows at a time, apply JSON.parse and transform the documents to valid sql/mongodb inserts

Add the insert (string) in a SQLite table with the schema (singleInsert STRING)

Remove the old and now unused row from the SQLite table

Write all inserts to file (if run from the command line) or return a dataHandle to query the data in the SQLite table (if run from other node app)

This solution does handle the problem with RAM, because SQLite automatically swaps to the Harddrive when the RAM is full

BUT

As you can see there are a lot of JSON.parse and JSON.stringify involved, which slows down the whole process drastically

What I have thought:

Maybe I should modify the IGeneratedField to only use shortend names as variables (fieldName -> fn, fieldValue -> fv, fieldIsObject -> fio, fieldIsArray -> fia, ....)

This would make the needed storage in the SQLite table smaller, BUT it would also make the code harder to read

Use a document oriented database (But I have not really found one), to handle JSON data better

The Question

Is there any better solution to handle big objects like this in node?

Is my temporary solution OK? What is bad about it? Can it be changed to perform better?

I'd use SQLite and an in-memory table. SQLite is built to handle the RAM and disk swap management. Similar to solution 3 but there's no need to host a database or require network connections. The SQLite database would exist on the destination PC — Jason, Jun 10 '19 at 01:34
@Jason Thanks for your time. Do you think it would be a good idea to use SQLite? It's a relational database and the objects that are saved have nested elements and so on. Would this even be possible in SQLite? (I have little to no knowledge in SQLite). — MauriceNino, Jun 10 '19 at 01:37

blackening · Answer 1 · 2019-06-20T18:36:56.463

2

Conceptually, generate items in a stream.

You don't need all 1M users in db. You could add 10k at a time.

For the messages, random sample 2n users from db, those send messages to each other. Repeat till satisfied.

Example:

// Assume Users and Messages are both db.collections
// Assume functions generateUser() and generateMessage(u1, u2) exist.
const desiredUsers = 10000;
const desiredMessages = 5000000;
const blockSize = 1000;


(async () => {

for (const i of _.range(desiredUsers / blockSize) ) {
    const users = _.range(blockSize).map(generateUser);
    await Users.insertMany(users);
}


for (const i of _.range(desiredMessages / blockSize) ) {
    const users = await Users.aggregate([ { $sample: { size: 2 * blockSize } } ]).toArray();
    const messages = _.chunk(users, 2).map( (usr) => generateMessage(usr[0], usr[1]));
    await Messages.insertMany(messages);
}

})();

Depending on how you tweak the stream, you get a different distribution. This is uniform distribution. You can get more long tailed distribution by interleaving the users and messages. For example, you might want to do this for message boards.

Went to 200MB after i switched the blockSize to 1000.

edited Jun 20 '19 at 18:36

answered Jun 10 '19 at 01:50

blackening

903
6
14

The users -> messages example was only a really small one. Think about a model where each user has a list of friends, a list of friend requests, a list of groups, and so on. Then it would get really complex I think. Also, don't forget that the program has no direct access to the database, so it can't save it and take it again. There would need to be a temp database between. Which would again add a lot of reads/writes. – MauriceNino Jun 10 '19 at 12:47
@MauriceNino "don't forget that the program has no direct access to the database" Sorry i didn't see these requirements. Normally what people want to test is performance of a working database, via whatever API they are using. Its quite rare that one actually tests insert commands on raw MongoDB. Anyway if you really have that kind of issue, what i personally would do is generate a database externally, then mongodump/mongorestore it to a testing server. – blackening Jun 11 '19 at 02:42
@MauriceNino It's not really complex if you treat it as a random graph problem. Sample nodes, then add edges. Huge amounts of data can be generated out of a schema using https://www.npmjs.com/package/dummy-json. All you need to do is add "foreign-keys" (or embedded items) from sampling. – blackening Jun 11 '19 at 02:51
1

As I see it there are two use cases. 1) Create test data to validate that front end is working/ test with 'real data'. 2) As you described it -> penetration testing your db. To support both use cases I have to create the stuff I want to insert in the program and THEN write it out to whatever location I want (directly to db or file). – MauriceNino Jun 11 '19 at 07:47
1

Thanks for the tool. If I had that before the start of writing this program I could have saved the effort. Now I want to finish it though. – MauriceNino Jun 11 '19 at 07:48