0

I have a folder of 24 different files that all have the same tab-separated format:

This is an example:

zinc-n  with-iodide-n   8.0430  X
zinc-n  with-amount-of-supplement-n 12.7774 X
zinc-n  with-value-of-horizon-n 14.5585 X
zirconium-n as-valence-n    11.3255 X
zirconium-n for-form-of-norm-n  15.4607 X

I want to join the files in every possible combination of 2.

For instance, I want to join File 1 and File 2, File 1 and File 3, File 1 and File 4... and so on until I have an output of 552 files joining EACH file with EACH other file considering all the UNIQUE combinations

I know this can be done for instance in the Terminal with cat.

i.e.

cat File1 File2 > File1File2
cat File1 File3 > File1File3

... and so on.

But, to do this for each unique combination would be an extremely laborious process.

Is there a possible to automatize this process to join all of the unique combination using a command line in Terminal with grep for instance? or perhaps another suggestion for a more optimized solution than CAT.

owwoow14
  • 1,694
  • 8
  • 28
  • 43

1 Answers1

1

You can try with . I use the combinations() function from the itertools module and join() the contents of each pair of files. Note that I use a cache to avoid reading each file many times, but you could exhaust your memory, so use the best approach for you:

import sys 
import itertools

seen = {}

for files in itertools.combinations(sys.argv[1:], 2): 
    outfile = ''.join(files)
    oh = open(outfile, 'w')

    if files[0] in seen:
        f1_data = seen[files[0]]
    else:
        f1_data = open(files[0], 'r').read()
        seen[files[0]] = f1_data

    if files[1] in seen:
        f2_data = seen[files[1]]
    else:
        f2_data = open(files[1], 'r').read()
        seen[files[1]] = f2_data

    print('\n'.join([f1_data, f2_data]), file=oh)

A test:

Assuming following content of three files:

==> file1 <==
file1 one
f1 two

==> file2 <==
file2 one
file2 two

==> file3 <==
file3 one
f3 two
f3 three

Run the script like:

python3 script.py file[123]

And it will create three new files with content:

==> file1file2 <==
file1 one
f1 two
file2 one
file2 two


==> file1file3 <==
file1 one
f1 two
file3 one
f3 two
f3 three


==> file2file3 <==
file2 one
file2 two
file3 one
f3 two
f3 three
owwoow14
  • 1,694
  • 8
  • 28
  • 43
Birei
  • 35,723
  • 2
  • 77
  • 82
  • How can you maintain the format of the original files. For instance, after joining the two files, the end line is not recognized (i.e. one of the result files is the following: zirconium-n for-form-of-norm-n 15.4607 Xzinc-n with-iodide-n 8.0430 X zinc-n with-amount-of-supplement-n 12.7774 X ) I modified the print to print(''.join([f1_data, f2_data] + "\n"), file=oh) but it gives me an error. What would you suggest? – owwoow14 Oct 22 '13 at 11:48
  • @owwoow14: Python3 or Python2? Windows, Linux or what OS? – Birei Oct 22 '13 at 11:51
  • I used Python3 (as in your example). I tried this with both MacOSx and Linux. – owwoow14 Oct 22 '13 at 11:59
  • Do you mean that it removes all newline characters? It's strange. I guess it's more problem from your input data that from `python`. What about if you create a test like mine with new files, does it work? – Birei Oct 22 '13 at 12:26
  • In fact, I did create "trial" files of 2 different lines (4 tab-separated columns) in three files (which I named: File1 File2 and File3)that have examples of the exact content in the actual files that I want to work with. I checked the coding of the file (I created them using TextMate on a Mac) and it says MacOSRoman, which should recognize the new line characters. – owwoow14 Oct 22 '13 at 12:28
  • @owwoow14: Sorry. It's difficult for me to help when I can't reproduce your problem. You can upload (to `pastebin` or similar) some test files that don't work and let me test them. – Birei Oct 22 '13 at 12:33
  • Modifying "print" to: print('\n'.join([f1_data, f2_data]), file=oh). Solves the problem! I tried to modify the code in your answer but I am in queue. Thanks again. – owwoow14 Oct 22 '13 at 13:55