I have a text file(call it grand parent file) which contains 1 million lines. Each of these lines contain absolute paths of some other files(call them parents) as shown below. The paths of parent files are unique.
%: cat input.txt - grand parent file
/root/a/b/c/1.txt -- parent file1
/root/a/b/c/2.txt -- parent file2 ......
...
/root/a/b/d/3.txt
......
.....
upto 1 million files.
Again each of the above parent file contains absolute paths of different files(Call them childs) and their line numbers as shown below: Same child files may be present in multiple parent files with same or different lumbers.
%: cat /root/a/b/c/1.txt -- parent file
s1.c,1,2,3,4,5 -- child file and its line numbers
s2.c,1,2,3,4,5....
...
upto thousands of files
%: cat /root/a/b/c/2.txt
s1.c,3,4,5
s2.c,1,2,3,4,5....
...
upto thousands of files
Now my requirement is that, given a child file and line number I need to return all the parent files that have the given child file number and line data present with in a minute. The insertion needs to be completed with in a day.
I created a relational database with following schema:
ParentChildMapping - Contains the required relation
ID AUTOINCREMENT PRIMARY KEY
ParentFileName TEXT
ChildFileName TEXT
LNumber INT
For a given file name and line number:
SELECT ParentFileName from ParentChildMapping where ChildFileName="s1.txt" and LNumber=1;
I divided grand parent file to 1000 separate sets each containing 1000 records. Then I have a python program which parses each set and reads the content of the parent file and inserts into the database. I can create thousand processes running in parallel and insert all the records in parallel but I am not sure what will be the impact on the relational database as I will be inserting millions of records in parallel. Also I am not sure if relational database is the right approach to chose here. Could you please let me know if there is any tool or technology that better suits this problem. I started with sqlite but it did not support concurrent inserts and failed with database lock error. And Now I want to try MySQL or any other alternate solution that suits the situation.
Sample Code that runs as thousand processes in parallel to insert into MySQL:
import MySQLDb
connection = MySQLDb.connect(host, username,...)
cursor = connection.cursor()
with open(some_set) as fd:
for each_parent_file in fd:
with open(each_parent_file) as parent_fd:
for each_line in parent_fd:
child_file_name, *line_numbers = each_line.strip().split(",")
insert_items = [(each_parent_file, child_file_name, line_num) for line_num in line_numbers]
cursor.executemany("INSERT INTO ParentChildMapping (ParentFileName, ChildFileName, LineNumber) VALUES %s" %insert_items)
cursor.commit()
cursor.close()
connection.close()