0

Below is the sample data:

1 ,ASIF JAVED IQBAL JAVED,JAVED IQBAL SO INAYATHULLAH,20170103
2 ,SYED MUSTZAR ALI MUHAMMAD ILYAS SHAH,MUHAMMAD SAFEER SO SAGHEER KHAN,20170127
3 ,AHSUN SABIR SABIR ALI,MISBAH NAVEED DO NAVEED ANJUM,20170116
4 ,RASHAD IQBAL PARVAIZ IQBAL,PERVAIZ IQBAL SO GUL HUSSAIN KHAN,20170104
5 ,RASHID ALI MUGHERI ABDUL RASOOL MUGHERI,MUMTAZ ALI BOHIO,20170105
6 ,FAKHAR IMAM AHMAD ALI,MOHAMMAD AKHLAQ ASHIQ HUSSAIN,20170105
7 ,AQEEL SARWAR MUHAMMAD SARWAR BUTT,BUSHRA WAHID,20170106
8 ,SHAFAQAT ALI REHMAT ALI,SAJIDA BIBI WO MUHAMMAD ASHRAF,20170106
9 ,MUHAMMAD ISMAIL SHAFQAT HUSSAIN,USAMA IQBAL,20170103
10 ,SULEMAN ALI KACHI GHULAM ALI,MUHAMMAD SHARIF ALLAH WARAYO,20170109

1st is serial #, 2nd is sender, 3rd is receiver, 4th is date and this data goes on for like million rows.

Now, i want to find which same sender sends the parcel to same receiver on the same date.

I wrote the following basic code for this but its very slow.

import csv
from fuzzywuzzy import fuzz



serial = []
agency = []
rem_name = []
rem_name2 = []
date = []

with open('janCSV.csv') as f:
    reader = csv.reader(f)

    for row in reader:
        serial.append(row[0])
        rem_name.append(row[2])
        rem_name2.append(row[2])
        date.append(row[4])


with open('output.csv', 'w') as out:

for rem1 in rem_name:

    date1 = date[rem_name.index(rem1)]
    serial1 = serial[rem_name.index(rem1)]

    for rem2 in rem_name2:

        date2 = date[rem_name2.index(rem2)]

        if date1 == date2:
            ratio = fuzz.ratio(rem1, rem2)

            if ratio >= 90 and ratio < 100:
                print serial1, rem1, rem2, date1, date2, ratio
                out.write(str(serial1) + ',' + str(date1) + ',' + str(date2) + ',' + str(rem1) + ',' + str(rem2) + ','
                          + str(ratio) + '\n')
halfer
  • 19,824
  • 17
  • 99
  • 186
  • What is your question? Do you want to speed this up? Would sorting the input text file be advantageous in terms of finding to/from duplicates? – halfer Mar 24 '17 at 22:43
  • Yes, I want to speed up the process.. i think nested for loops are making it slow.. Or any other way so I can rewrite the code in a better way ? Sorting the input file doesn't help – Salman Akhtar Mar 25 '17 at 07:49
  • I can't see how sorting would not help. You would be able to scan the sorted version much more quickly, since all runs with the same sender and receiver would be grouped together, and can be shown with one pass through the file (plus the cost of the initial sort). – halfer Mar 25 '17 at 09:08
  • I tried running the sorted file, but its taking same amount of time because the logic remains the same.. what the code does is, that it take one name and check the fuzz.ratio of all the other names where date1 == date2.. – Salman Akhtar Mar 25 '17 at 12:05
  • Where does the data originate from? Is it possible to store it in a pre-sorted fashion, so that if rows are inserted they are placed in the correct sort order to start with? – halfer Mar 25 '17 at 17:20
  • data is in csv. So, yes, it can be sorted in excel before giving it to python – Salman Akhtar Mar 26 '17 at 07:34
  • Yes, it is a CSV file, I can see that. How is that written to, and could the writer pre-sort it? I am not sure Excel is the best tool for sorting though, since these sound like large files. – halfer Mar 26 '17 at 08:35
  • yes, the files are large so sometimes Excel crashes while sorting.. No, the writer does not pre-sort it.. Can I use Pandas or some other module to sort the file first ? – Salman Akhtar Mar 26 '17 at 11:56
  • I think I might need to clarify what I mean by "pre-sort". I do not mean sorting it in one operation prior to searching for duplicates. I mean it should be sorted once, and then whenever new rows are added, they are inserted into the correct sort position immediately. I can't tell you if that would be suitable for your needs (since it will slow down your write speeds) but it would eliminate the need to do a fresh whole-file sort, so it is worth looking at. – halfer Mar 26 '17 at 13:49
  • Yes, don't use Excel for sorting - either use Python or unix command line tools. – halfer Mar 26 '17 at 13:49
  • oh.. No, I don't have the data pre-sorted.. :-( – Salman Akhtar Mar 26 '17 at 17:44
  • No you don't, that is very clear. Would it be possible to sort as you go? What process is writing this file? – halfer Mar 26 '17 at 17:46

0 Answers0