-1

I have a code in python that reads a very huge file fetches data from another file and writes into a new file with the matched and not matched values.

say for eg

file 1: 
ab
bc
cd
gh

file 2:
ab t1 catch1
ab t1 catch2
bc t1 catch1
bc t2 catch3
bc t1 catch4
ef t7 catch1

output : 
ab catch1 
   catch2
bc catch1
   catch3
   catch4
cd
gh

My Code:
    with open("list_with-detail.ids") as f:
      for line in f:
        if id in line:
          do printing

I am dealing with very huge file i.e ~10 GB which takes minutes for fetching relevant data for each id. The id list to be fetched is also very huge i.e ~20 MB.

I want to know a better/faster way to deal with this issue.

onkar
  • 29
  • 7
  • Possible duplicate of [most efficient way to find partial string matches in large file of strings (python)](https://stackoverflow.com/questions/4839597/most-efficient-way-to-find-partial-string-matches-in-large-file-of-strings-pyth) – tripleee Jan 18 '18 at 07:45
  • This is much too vague to be properly useful, and already has a large number of duplicates. What makes sense depends on how many times you need to perform this and how exactly the data is structured. If one of the sets is fixed over multiple runs then indexing the data (use a database) is probably the way to go. – tripleee Jan 18 '18 at 07:47
  • If your question is about maintaining Python 2.x code then the [tag:python-2.7] tag is adequate, but then you should not also tag as [tag:python-3.x], and vice versa. – tripleee Jan 18 '18 at 07:48
  • Actually I what I provided you is a simplified version of my file, Just because I want to know the most efficient way. – onkar Jan 25 '18 at 06:33
  • In real case I have to fetch a id from first file -> first column. search these ids in second file. the second file contains the fetched ids in its last column and I have to get the new id from that file which is located in the first column when separated by space. and for each id that i got from first file there are multiple new ids. Now these new ids have to be matched in third file for each new ids there are number of details in second column. so i have to fetch all these details make them unique and the output them corresponding to the initial ids. – onkar Jan 25 '18 at 06:44
  • i.e first column will indicate the id and second column will indicate its description that I got from third file. my first file has ~110,000 ids, second file has ~80,000 lines containing relevant information and 3rd file is of ~10 gb size – onkar Jan 25 '18 at 06:44
  • This is the reason I asked this question. So if you have any siggestions in either perl, python or shell script kindly let me know – onkar Jan 25 '18 at 06:46
  • The first two files can be simplifirs by extracting only the required content i.e id from first file and id and corresponding second id from second file. But in my code doing this did not made any significant difference in time. the section that takes most time is fetching data from third file and that is already in its simplified form i.e first column contains id and second column contains corresponding details – onkar Jan 25 '18 at 06:51
  • Still sounds like you should be using a database, but you still have not explained how often the data changes. If it's a static set then definitely a database. If it changes all the time, maybe look for another way to avoid having this in files. Still not clear. – tripleee Jan 25 '18 at 07:11
  • data changes with every analysis. the first two files come from some other analysis and third file is consistent and I have to match the ids from first analysis to ids from second analysis and then collect information from the third file. which is always same for all the analysis. – onkar Jan 29 '18 at 09:21

1 Answers1

1

Perhaps not the most efficient, but here's a straight forward pure Python example. This example uses a Python dict to first index the contents of the data file. The index can then be used to quickly locate and read the records in a random manner according to first file.

Note that a more robust solution might be to load the data to a proper database, e.g. sqlite3.

from collections import defaultdict

# Use a default dict to store a list of file positions found for each key
idx = defaultdict(list)

# Index the contents of the second file
file2 = open('/file2/path')
i = 0
while True:
    # get the current file position
    loc = file2.tell()
    l = file2.readline()
    if not l: break
    k = l.split()[0]
    # Store a list of file positions for each key
    idx[k].append(loc)    
    i += 1

# The idx object could now be serialized to disk for later access.

# Read all second file contents sequentially for each key in the first file
file1 = open('/file1/path')
for l in file1.readlines():
    k = l.split()[0]
    locs = idx.get(k, [])
    print(k)
    for loc in locs:
        # Jump to the indexed file position and read the line
        file2.seek(loc)
        row = file2.readline()
        print('\t', row.strip())

Output:

ab
     ab t1 catch1
     ab t1 catch2
bc
     bc t1 catch1
     bc t2 catch3
     bc t1 catch4
cd
gh
tharen
  • 1,262
  • 10
  • 22
  • Thanks for your code. I am working on it to modify according to my requirements. Hope this takes less time – onkar Jan 25 '18 at 06:25
  • I had written code in python and shell, the shell script completed in 7 days and the python script did not complete even 1/10 of the file by this time – onkar Jan 25 '18 at 06:29
  • I also tried loading all the files in the memory and then processing them but this way it takes even longer time, I don't know why – onkar Jan 25 '18 at 06:31
  • It's hard to say without more info. I'd start by verifying that the indexing loop works. Use a small test file to do this. – tharen Jan 25 '18 at 06:35
  • yes I tested this and it worked. But as I explained in above comment I will have to see how much time it saves when I use with actual data – onkar Jan 25 '18 at 06:57