Find the similar columns in two files

Question

I am using python and I want to write a code, which will compare only two first columns (0:1) and if the other file has the same values in the column 0 and 1, then the line should be merged and written into one new file.

for example:

Or example with persons:

the code which I have now is this, but it is not good

f1 = open('f1.txt','r')
f2 = open('f2.txt','r')
f3 = open('f12.txt','w')


f1_readlines = f1.readlines()
f1_linenum = len(f1_readlines)


f2_readlines = f2.readlines()
f2_linenum = len(f2_readlines)

i=0
while(i<f1_linenum):
    j=0
    while(j<f2_linenum):
        if(f1_readlines[0:1] == f2_readlines[0:1]):
            print(f1_readlines[i])
            f12.write(f1_readlines[i])
        j = j + 1
    i = i + 1

f1.close()
f2.close()
f12.close()

Are the files sorted in any sense? (Your example file `f2` does not appear to be sorted.) Does the ordering of the files (input or output) matter at all? It is much easier to compare lines when the files are sorted by the items you are comparing. — Rory Daulton, Mar 31 '18 at 11:43
yes the files are sorted with some sense, but line number are different. the output should be f1 + f2 (without first two columns).I have to compare the first line of f2 with all lines in f1, then the second, the third and so on... — November, Mar 31 '18 at 11:49
Are the files in a specic format i.e. CSV or CTV or fixed length columns? — Rehan Azher, Mar 31 '18 at 13:15
@RehanAzher: The second example shows that the columns are not fixed length. My answer works with either space-separated or tab-separated files, since Python's `split` method handles them the same. — Rory Daulton, Mar 31 '18 at 13:18
@RoryDaulton I haven't challenged any solution provided yes those will work. What would be the performance of this with files of 1M records each? — Rehan Azher, Mar 31 '18 at 13:21
@RehanAzher: The only way to be sure is to run both answers on such large files. I believe that my answer would be much faster, since dede's answer compares all lines to each other, meaning 1M times 1M = 1 trillion comparisons, while I have only 1M comparisons, each using a dictionary lookup which happens in near-constant time. — Rory Daulton, Mar 31 '18 at 13:33
@RoryDaulton, Sir yes your answer will be surely faster, I was just thinking for a different approach how about we use some library like Pandas to read files and then use something like groupby on multiple columns. — Rehan Azher, Mar 31 '18 at 13:36
@Rory: correct! My code is easy to understand....its not good for a script that runs every day on huge files. — dede, Mar 31 '18 at 13:38
@dede: You are correct, of course. I modified my answer to say that your code is simpler than mine. I apologize for not saying that earlier--I suppose that I thought it was obvious, but I still should have said it. — Rory Daulton, Mar 31 '18 at 13:51

dede · Answer 1 · 2018-03-31T12:35:08.280

1

What is your problem with the code?

Okay...using a while-loop is not the best. You can do it like this:

f1 = open('f1.txt','r')
f1_readlines = f1.readlines()
for ln in f1_readlines:
    print ln,

(EDIT) This code:

f1_readlines=[[10, 10, 10], [20, 20, 20], [30, 30, 30]]
f2_readlines=[[20, 20, 25], [30, 30, 11], [25, 25, 55], [44, 44, 58]]
f12=[]

for f1e in f1_readlines:
  for f2e in f2_readlines:
    if f1e[0]==f2e[0] and f1e[1]==f2e[1]:
      f1e.append(f2e[2])
      f12.append(f1e)
      break

for e in f12:
  print e

gives:

[20, 20, 20, 25]
[30, 30, 30, 11]

edited Mar 31 '18 at 12:35

answered Mar 31 '18 at 11:36

dede

706
9
19

My code in not comparing f2 line with all f1 lines, I think, it just compares line by line – November Mar 31 '18 at 12:03
Sorry, I am new in programming. What is f1e, f2e? – November Mar 31 '18 at 13:15
nothing special....simple loop-variables. Insert a print-statement between the for-loops to see, what value they get in each iteration. – dede Mar 31 '18 at 13:18
they are used in the loops only. They get a new value in each iteration. See: https://stackoverflow.com/q/4170656/3588613 – dede Mar 31 '18 at 15:00

Rory Daulton · Accepted Answer · 2018-03-31T15:05:49.600

Here is an algorithm to do what you want with two unsorted files. This algorithm has the advantage of needing only one of the files in memory and has a time complexity that is linear in the sum of the lengths of the input files. In other words, this algorithm uses small amounts of memory and of time--many other algorithms will take more space or time. (The answer by @dede seems to work well and is simpler than mine but it uses both more memory and more time, which will be very noticeable for large input files.)

First, read through the first file, line by line, and build a dictionary from it. Each key is a tuple of the first two items in a line, and the corresponding value is the rest of the line. In your second example the resulting dictionary would be

{('Brad', 'Pitt'): 'cherries', ('Angelina', 'Jolie'): 'bred', ('Jack', 'Nicholson'): 'apples', ('Nicole', 'Kidman'): 'cucumber'}

You then create an empty output file and read through the second input file line by line. For each line, you see if the first two items are in the dictionary. If so, print your desired line to the output file. If not, do nothing with that line.

Your main use of memory is then the dictionary. You went through each input file only once and line by line, so it is fast. The main possible weakness of this approach is that the output file will be in the order that the items were in the second input file, which is the order in your second example. If you desire the order of the first input file instead, just swap the usage of the two input files.

Here is my code from that algorithm. This version assumes that each input line has exactly three items separated by spaces or tabs. If the "third item" in a line could include a space or a tab, the code would need to be complicated a little. Using your example input files, the results from this code are just what you wanted.

def similar_columns(filenameinput1, filenameinput2, filename_output):
    """Find the similar columns in two files.
    This assumes each line has exactly three items.
    """
    # Build a dictionary of the items in the first input file
    items_dict = {}
    with open(filenameinput1, 'r') as inputfile1:
        for line in inputfile1:
            col0, col1, oldcolrest = line.split()
            items_dict[(col0, col1)] = oldcolrest

    # Compare the items in the second input file, saving matches
    with open(filenameinput2, 'r') as inputfile2,  \
         open(filename_output, 'w') as outputfile:
        for line in inputfile2:
            col0, col1, newcolrest = line.split()
            oldcolrest = items_dict.get((col0, col1), None)
            if oldcolrest is not None:
                outputfile.write('{} {} {} {}\n'.format(
                        col0, col1, oldcolrest, newcolrest))

similar_columns('f1.txt', 'f2.txt', 'f12.txt')
similar_columns('shop1.txt', 'shop2.txt', 'total.txt')

jklasdjkl · Answer 3 · 2018-03-31T15:19:19.450

I have tried to solve it in a naive way

f1 = open('f1.txt','r')
f2 = open('f2.txt','r')
f3 = open('fusion.txt','w')

# read f1 lines from file
f1_readlines = f1.readlines()
# get f1 length
f1_linenum = len(f1_readlines)

# read f2 lines from file
f2_readlines = f2.readlines()
# get f2 length
f2_linenum = len(f2_readlines)

for f1_line in f1_readlines:
    arr1 = f1_line.split(" ")
    c11 = arr1[0]
    c12 = arr1[1]

    for f2_line in f2_readlines:
        arr2 = f2_line.split(" ")
        c21 = arr2[0]
        c22 = arr2[1]
        if((c11 == c21) & (c12 == c22)):
            output = [c11,c12]

            for c in arr1[2:]:
                s = c.strip("\n")
                output.append(s)

            for c in arr2[2:]:
                s = c.strip("\n")
                output.append(s)

            for num in output[0:len(output)-1]:
                f3.write(num)
                f3.write(" ")
            f3.write(output[len(output)-1])
            f3.write("\n")

f1.close()
f2.close()
f3.close()

f1

f2

10 10 55
77 77 77
20 22 20
11 13 11
2 23 23
56 66 55 99
78 56 56

fusion

10 10 25 55
56 66 55 55 99
78 56 56 7 56
77 77 77 77

Hope this can solve it :)

This seems to be the same algorithm as dede's answer. How is this answer better? — Rory Daulton, Mar 31 '18 at 16:13

Find the similar columns in two files

3 Answers3