Where should I use sharding in mongodb or run multiple instance of mongodb?

Question

Issue

I have at least 10 text files(CSV), each reaches to 5GB in size. There is no issue when I import the first text file. But when I start importing the second text file it shows the Maximum Size Limit (16MB).

My primary purpose for using the database is for searching the customers from the database using customer_id index.

Given Below is the details of One CSV File.

Collection Name|Documents|Avg.Document Size|Total Document Size|Num.Indexes| Total Index Size|Properties

Customers|8,874,412|1.8 KB|15.7 GB|3|262.0 MB

To overcome this MongoDB community were recommending GridFS, but the problem with GridFS is that the data is stored in bytes and its not possible to query for a specific index in the textfile.

I don't know if its possible to query for a specific index in a textfile when using GridFS. If some one knows any help is appreciated.

Then the other solution I thought about was creating multiple instance of MonogDB running in different ports to solve the issue. Is this method feasible?

But lot of the tutorial on multiple instance shows how to cerate a replica set. There by storing the same data in the PRIMARY and the SECONDARY.
The SECONDARY instances don't allow to write and only allows to read data.

Is it possible to create multiple instance of MongoDB without creating replica set and with write and read operations on them? If Yes How? Can this method overcome the 16MB limit.

Second Solution I thought about was creating shards of the collections or simply sharding. Can this method overcome the 16MB limit. If yes any help regarding this.

Of the two solutions which is more efficient for searching for data (in terms of speed). As I mentioned earlier I just want to search of customers from this database.

Either you can break data into two collections or use GridFS which divides the file into parts, or chunks, and stores each chunk as a separate document. — ROHIT KHURANA, Jan 18 '22 at 05:27
https://jira.mongodb.org/browse/SERVER-431?focusedCommentId=22283&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22283 — ROHIT KHURANA, Jan 18 '22 at 05:30
@ROHITKHURANA Is it possible to search for an index in the text file after it is split using GridFS. If yes How? — CyberNoob, Jan 18 '22 at 05:39
The 16 MiB limit applies to the document and it is fix. Do you try to insert each entire text file as one document? For a CSV file you usually generate one document per line! What is the format of the CSV file and how do you import them? — Wernfried Domscheit, Jan 18 '22 at 06:21
@WernfriedDomscheit Do you try to insert each entire text file as one document? I don't know what it means. I was importing the csv files to the collection. The preview shows each rows and columns. **The csv file I used is comma separated and with 43 columns. Some columns have big strings. Each line is imported as a record in MongoDB.** — CyberNoob, Jan 18 '22 at 07:13
You mean a single line in your CSV file is longer than 16 MiByte? Again, how do you import the file? How does the file look like? — Wernfried Domscheit, Jan 18 '22 at 07:28
@WernfriedDomscheit I use MongoDB compass for importing CSV file to collection. Given Below is the details of a Single CSV file that I imported to MongoDB. `**Collection Name|Documents|Avg.Document Size|Total Document Size|Num.Indexes| Total Index Size|Properties**` `**Customers|8,874,412|1.8 KB|15.7 GB|3|262.0 MB**` — CyberNoob, Jan 18 '22 at 07:50
Where do you get this error? The average size of a document is 1.8kiB - which is magnitude below the limit of 16 MiB. Unless you let us know **how** you import the data, we cannot help you. — Wernfried Domscheit, Jan 18 '22 at 08:07
so why not create a document for each line? why do you need to store everything in 1 document? or does 1 line exceed 16MB which seems pretty unlikely to me — Lars Hendriks, Jan 18 '22 at 08:19
A screenshot is not really helpful! When you select "Stop on Error" then you may get the line of the error. Looks like in your file you a huge line which is longer then 16MiB. Maybe read the CSV with a different tool and print the size/lenght of it. BTW, date/time values should **NEVER** be stored as strings, store always proper `Date` objects! — Wernfried Domscheit, Jan 18 '22 at 08:52
You may try to import the file with [mongoimport](https://docs.mongodb.com/database-tools/mongoimport/) using the `--stopOnError` option. It should show the culprit in the error message. — Wernfried Domscheit, Jan 18 '22 at 09:06
Yes, I agreed with @WernfriedDomscheit, You can use mongoimport or check if any of the lines in CSV have different data where you are getting errors. — ROHIT KHURANA, Jan 19 '22 at 01:07
Do you have a Linux machine? With such single command you can check the size of lines: `while read -r line; do [[ ${#line} -gt 16000000 ]] && echo $line ; done < BSNL-DEC2020-EKYC.csv` — Wernfried Domscheit, Jan 19 '22 at 06:05
@WernfriedDomscheit I am on a windows machine. This is the error I receive. Failed: read error on entry #8437: line 13530, column 627: extraneous " in field 2022-01-19T12:42:13.931+0530 8000 document(s) imported successfully. 0 document(s) failed to import. **Is there a way to skip this error and import all documents** — CyberNoob, Jan 19 '22 at 07:14

score 1 · Accepted Answer · answered Jan 19 '22 at 07:26

1

The error message shows exactly where the problem is: entry #8437: line 13530, column 627

Have a look at the file and correct it in the file.

The error extraneous " in field ... is quite clear. In your CSV file you have an opening quote " but it is not closed, i.e. the rest of entire file is considered as one single field.

answered Jan 19 '22 at 07:26

Wernfried Domscheit

54,457
9
76
110

is there a way in python csv.DictReader to remove these **"** – CyberNoob Jan 19 '22 at 08:01
I would chase the person/system who generated the CSV-File - which is in fact not a valid CSV-File! I am not familiar with python, have a look at https://stackoverflow.com/questions/65354842/handling-unwanted-standalone-double-quotes-in-a-csv-python – Wernfried Domscheit Jan 19 '22 at 08:12

Where should I use sharding in mongodb or run multiple instance of mongodb?

1 Answers1