0

I have 3 files that contain some arbitrary number of rows (specified in the first line). I want to get all the common rows in those files. For example, in every file, I have a number of rows the file contains and each line contains four space-separated coordinates.

file1.txt:

5    
820.3  262.48  637.815  232.503  
657.666  773.366  466.608  754.035  
341.845  245.408  163.417  212.897  
667.378  687.189  474.277  666.181  
518.451  899.594  343.431  881.08  

file2.txt

3  
1.52 6.878 9.5485  
341.845  245.408  163.417  212.897  
667.378  687.189  474.277  666.181  

file3.txt

4  
657.666  773.366  466.608  754.035  
341.845  245.408  163.417  212.897  
667.378  687.189  474.277  666.181  
518.451  899.594  343.431  881.08    

My output file res.txt should be:

res.txt

2  
341.845  245.408  163.417  212.897  
667.378  687.189  474.277  666.181    

Here we have 2 common rows and hence that should be printed in the first line. How to scale this for multiple files?

I have tried writing a python script for handling two files, but I think it's not so efficient. The code I tried is:

import numpy as np

l1 = []
l2 = []

with open('matchings1_2.txt', 'r') as f1:
    for line in f1:
        line = line.split()
        l1.append(line)

with open('matchings2_3.txt', 'r') as f2:
    for line in f2:
        line = line.split()
        l2.append(line)


l1 = np.array(l1[1:]).astype(float)
l2 = np.array(l2[1:]).astype(float)
l = []

for r in l1:
    if r in l2:
        l.append(list(r))

l.insert(0, [len(l)])

with open('Result.txt', 'w') as f:
    for item in l:
        s = ""
        for i in range(len(item)):
            if (i != len(item) - 1):
                s += str(item[i]) + " "
            else:
                s += str(item[i])
        f.write("%s\n" % s)
Rocket Nikita
  • 470
  • 2
  • 7
  • 20
Akash Tadwai
  • 100
  • 1
  • 9
  • What you are looking for is OOP: object oriented programming. – Kraay89 Oct 29 '20 at 09:35
  • Actually, a part of my work needed a function like this. Rather than always checking two files at a time. I was expecting if there would be a cleaner way. I have all the required files in a directory and I want all the common lines in all the files. – Akash Tadwai Oct 29 '20 at 09:39
  • @AkashTadwai maybe my answer can help you a bit. Please let me know if you have any queries. – Aryman Deshwal Oct 29 '20 at 10:07

3 Answers3

1

I have written a shorter code and hopefully its not too complex, I kinda over did it I think.

rows =[] # to store rows of all files in a nested list
file_names =["f1","f2","f3"] # names of text files
for file in file_names:
    f1 = open(file+".txt","r")
    temp =[] #to store rows of each file separately 
    for i in f1:
        s = i.rstrip() # removes next line character from both ends of each row
        if len(s)!=1: # to exclude first line of each row
            temp.append(s)
    rows.append(set(temp)) # storing as a set so that we can use intersection
    f1.close()

final_rows = rows[0] # initializing as rows of first files
for i in range(1,len(rows)):
    final_rows = final_rows.intersection(rows[i]) # repeated intersection

f1 = open("res.txt",'w')
f1.write(str(len(final_rows))+"\n") # storing the length of common rows
for i in final_rows:
    f1.write(i+"\n") #storing the common rows
f1.close()

incase all your files are in a same directory with the same format you can make a few changes:

import os
file_names = os.listdir()# if this python file and text files are in same directory or use os.listdir("xyz/abc") incase they are in other directory
for file in file_names:
    f1 = open(file,"r") # use file instead of file+".txt"
  • Is there any way to get away from spaces?? I mean you are treating, 1 2 3 4 and 1 2 3 4 are different. But actually both lines are same – Akash Tadwai Oct 29 '20 at 11:37
  • I am sorry I dont understand, could you please elaborate ? – Aryman Deshwal Oct 29 '20 at 19:00
  • I mean as we are storing the whole line in the list and a line contains 4 coordinates as I said before, We are storing all the four coordinates (including spaces between them) as one element of the list. So, 1 2 3 4 and $1\qquad 2 \quad 3 \quad 4$ are different but should be the same right? – Akash Tadwai Oct 30 '20 at 08:22
  • you can separate them using split() function. for eg. x = "123 123 123 123 123".split(" ")# or split() that is splitting by space, by default it also considers space x will be ["123","123","123","123","123"] – Aryman Deshwal Oct 31 '20 at 11:21
1

The set intersection could be the way to achieve this, as already suggested in @Aryman's answer. To apply the operation on a sequence of undefined length, you can use functools.reduce.

from functools import reduce
from pathlib import Path


def lines(text_file):
    with open(text_file) as f:
        result = f.read().splitlines()
    return result


unique_lines = (set(lines(file)[1:])    # exclude the first line
                for file in Path('folder').glob('file*.txt'))

common_lines = reduce(lambda x, y: x & y, unique_lines)

print(list(common_lines))

where x & y is equivalent to x.intersection(y). You could also use operator.and_ instead of the lambda.

Output:

['667.378  687.189  474.277  666.181', '341.845  245.408  163.417  212.897']
mportes
  • 1,589
  • 5
  • 13
  • Is there a way to get rid of spaces in one line and compare?? I mean, 1 2 3 4 and 1 2 3 4 are same. – Akash Tadwai Oct 29 '20 at 11:41
  • Yes, by calling `line.split()` you will get a list of values without the spaces. However, a list cannot be element of a set, so you would have to store the values in a tuple like `tuple(line.split())`. – mportes Oct 30 '20 at 09:51
0

I had a go at programming a solution to the problem, which I have pasted below. Its all commented so I hope its easily readable :)

import os  # a library for accessing the os

all_rows = []  # to load all lines into
res = []  # to load result into
number_files = 0
path_to_files = "."  # you can use "." if your files are in the same directory as the .py file

for file in os.listdir(path_to_files):  # put your path to files here, lists all files in that directory
    if file.startswith("file") and file.endswith(".txt"):
        number_files += 1  # keep a count of number of files for later
        with open(file, "r") as f:
            content = f.readlines()  # read all lines
            content = [x.strip() for x in content]  # remove \n from lines
            all_rows.extend(content)  # add all items of content to all_rows without creating a 2d list
            f.close()
for i in range(1, int(all_rows[0]) + 1):  # all rows in first file
    if all_rows.count(all_rows[i]) == number_files:  # if row occurs in all files
        res.append(all_rows[i])  # append to res
res.insert(0, str(len(res)))  # insert number of rows into res
with open(os.path.join(path_to_files, "res.txt"), "w+") as r:  # create new file in directory called res.txt
    for row in res:  # for every row which all files have in common
        r.write(row + "\n")  # add newline character
    r.close()  # close file