-1

I have a text file(call it grand parent file) which contains 1 million lines. Each of these lines contain absolute paths of some other files(call them parents) as shown below. The paths of parent files are unique.

%: cat input.txt - grand parent file
/root/a/b/c/1.txt  -- parent file1
/root/a/b/c/2.txt  -- parent file2 ......
...
/root/a/b/d/3.txt
......
.....
upto 1 million files.

Again each of the above parent file contains absolute paths of different files(Call them childs) and their line numbers as shown below: Same child files may be present in multiple parent files with same or different lumbers.

%: cat /root/a/b/c/1.txt -- parent file
s1.c,1,2,3,4,5 -- child file and its line numbers
s2.c,1,2,3,4,5....
... 
upto thousands of files

%: cat /root/a/b/c/2.txt
s1.c,3,4,5
s2.c,1,2,3,4,5....
... 
upto thousands of files

Now my requirement is that, given a child file and line number I need to return all the parent files that have the given child file number and line data present with in a minute. The insertion needs to be completed with in a day.

I created a relational database with following schema:

ParentChildMapping - Contains the required relation
    ID AUTOINCREMENT PRIMARY KEY
    ParentFileName TEXT
    ChildFileName TEXT
    LNumber INT

For a given file name and line number:
    SELECT ParentFileName from ParentChildMapping where ChildFileName="s1.txt" and LNumber=1;

I divided grand parent file to 1000 separate sets each containing 1000 records. Then I have a python program which parses each set and reads the content of the parent file and inserts into the database. I can create thousand processes running in parallel and insert all the records in parallel but I am not sure what will be the impact on the relational database as I will be inserting millions of records in parallel. Also I am not sure if relational database is the right approach to chose here. Could you please let me know if there is any tool or technology that better suits this problem. I started with sqlite but it did not support concurrent inserts and failed with database lock error. And Now I want to try MySQL or any other alternate solution that suits the situation.

Sample Code that runs as thousand processes in parallel to insert into MySQL:

    import MySQLDb
    connection = MySQLDb.connect(host, username,...)
    cursor = connection.cursor()
    with open(some_set) as fd:
        for each_parent_file in fd:
            with open(each_parent_file) as parent_fd:
                for each_line in parent_fd:
                    child_file_name, *line_numbers = each_line.strip().split(",") 
                    insert_items = [(each_parent_file, child_file_name, line_num) for line_num in line_numbers]
                    cursor.executemany("INSERT INTO ParentChildMapping (ParentFileName, ChildFileName, LineNumber) VALUES %s" %insert_items)
        cursor.commit()
    cursor.close()
    connection.close()
SH47
  • 11
  • 3
  • 1
    Does this answer your question? [BULK INSERT in MYSQL](https://stackoverflow.com/questions/14330314/bulk-insert-in-mysql) – Dai Dec 10 '22 at 18:15
  • @Dai A simple back of the envelope of what the data looks like and what a database does shows that it will not be sufficient. – btilly Dec 11 '22 at 00:33
  • BULK INSERT does not answer the question @Dai – SH47 Dec 11 '22 at 05:46

1 Answers1

1

Let's start with a naïve idea of what a database would need to do to organize your data.

You have a million parent files.

Each one contains thousands of child files. Let's say 10,000.

Each one contains a list of line numbers. You didn't say how many. Let's say 100.

This is 10^6 * 10^4 * 10^2 = 10^12 records. Suppose that each is 50 bytes. This is 50 terabytes of data. We need it organized somehow, so we sort it. This requires on the order of log_2(10^12) which is around 40 passes. This naïve approach needs is 2 * 10^15 of data. If we do this in a day with 86400 seconds, this needs us to process 23 GB of data per second.

Your hard drive probably doesn't have 50 terabytes of space. Even if it did, it probably doesn't stream data faster than about 500 MB/second, which is 50 times too slow.

Can we improve this? Well, of course. Probably half the passes can happen strictly in memory. You can replace records with 12 byte tuples. There are various ways to compress this data. But the usual "bulk insert data, create index" is NOT going to give you the desired performance on a standard relational database approach.

Congratulations. When people talk about #bigdata, they usually have small data. But you actually have enough that it matters.

So...what can you do?

First what can you do with out of the box tools?

If one computer doesn't have horsepower, we need something distributed. We need a distributed key/value store like Cassandra. We'll need something like Hadoop or Spark to process data.

If we have those, all we need to do is process the files and load them into Cassandra as records, by parent+child file, of line numbers. We then do a map reduce to find, by child+line number of what parent files have it and store that back into Cassandra. We then get answers by querying Cassandra.

BUT keep in mind the back of the envelope about the amount of data and processing required. This approach allows us, with some overhead, to do all of that in a distributed way. This allows us to do that much work and store that much data in a fixed amount of time. However you will also need that many machines to do it on. Which you can easily rent from AWS, but you'll wind up paying for them as well.

OK, suppose you're willing to build a custom solution, can you do something more efficient? And maybe run it on one machine? After all your original data set fits on one machine, right?

Yes, but it will also take some development.

First, let's make the data more efficient. An obvious step is to create lookup tables for file names to indexes. You already have the parent files in a list, this just requires inserting a million records into something like RocksDB for the forward lookup, and the same for the reverse. You can also generate a list of all child filenames (with repetition) then use Unix commands to do a sort -u to get canonical ones. Do the same and you get a similar child file lookup.

Next, the reason why we were generating so much data before is that we were taking a line like:

s1.c,1,2,3,4,5

and were turning it into:

s1.c,1,/root/a/b/c/1.txt
s1.c,2,/root/a/b/c/1.txt
s1.c,3,/root/a/b/c/1.txt
s1.c,4,/root/a/b/c/1.txt
s1.c,5,/root/a/b/c/1.txt

But if we turn s1.c into a number like 42, and /root/a/b/c/1.txt into 1, then we can turn this into something like this:

42,1,1,5

Meaning that child file 42, parent file 1 starts on line 1 and ends on line 5. If we use, say, 4 bytes for each field then this is a 16 byte block. And we generate just a few per line. Let's say an average of 2. (A lot of lines will have one, others may have multiple such blocks.) So our whole data is 20 billion 16 byte rows for 320 GB of data. Sorting this takes 34 passes, most of which don't need to be written to disk, which can easily be inside of a day on a single computer. (What you do is sort 1.6 GB blocks in memory, then write them back to disk. Then you can get the final result in 8 merge passes.)

And once you have that sorted file, you can NOW just write out offsets to where every file happens.

If each child file is in thousands of parent files, then decoding this is a question of doing a lookup from filename to child file ID, then a lookup of child file ID to the range which has that child file listed. Go through the thousand of records, and form a list of the thousands of parent files that had the line number in their range. Now do the lookup of their names, and return the result. This lookup should run in seconds, and (since everything is readonly) can be done in parallel with other lookups.

BUT this is a substantial amount of software to write. It is how I would go. But if the system only needs to be used a few times, or if you have additional needs, the naïve distributed solution may well be cost effective.

btilly
  • 43,296
  • 3
  • 59
  • 88
  • Thank You Very Much for the inputs. I was able to reduce the size by 70% by converting file names to numbers. Also storing the line numbers as groups sounds interesting. I would try that too. Distributed solution looks promising and would explore that as well. Thank You. @btilly – SH47 Dec 11 '22 at 05:23