0

I have two very long text files (thousands of e-mail addresses, one per line) an I'm looking for a way to compare the two files and have an output with the adresses contained in the first file and in the second file but not in both of them (something like AUB/(A⋂B) in set theory). It would be pretty easy if I could use lists containing strings as input, like this

input1=['address1','address2',...,'addressn']

but since my text file is long and on different lines I should manually put each address among the ''. So I tried to use a single string with all the addresses separated by a space as input, and then to convert it into a list of strings. This is what I've come out with:

import numpy as np
from StringIO import StringIO

def conv(data):
    array1=np.genfromtxt(StringIO(data),dtype="|S50")
    lista1=[]
    for el in array1:
        lista1.append(el)
    return lista1

input1='address1 address2 ... addressn'

And this is what I get when I call the function

>conv(input1)
>['address1', 'address2', 'addressn']

It works, but I have a problem: inputs needs to be horizontal, so I can not copy my addresses from the text file and paste them into a string as I would get something like

input1="Davide
...:Michele
...:Giorgio
...:Paolo"

File "<ipython-input-4-6d70053fb94e>", line 1
  input1="Davide
             ^
SyntaxError: EOL while scanning string literal

How can I deal with this issue? Any suggestion to improve the code would be very apprecciated. I know almost nothing about the StringIO module, I came across it today for the first time, and I'm sure it's possible to write a much more efficient program than mine. This is the whole program by the way:

def scan(data1,data2): #Strings
    array1=np.genfromtxt(StringIO(data1),dtype="|S50")
    array2=np.genfromtxt(StringIO(data2),dtype="|S50")
    lista1=[]
    lista2=[]
    for el in array1:
        lista1.append(el)
    for el in array2:
        lista2.append(el) #lista1 and lista2 are lists containing strings
    num1,num2=len(lista1),len(lista2)
    shared=[]
    for el in lista1:
        if el in lista2:
            shared.append(el) #shared is the intersection of lista1 and lista2
    if len(shared)==0:
        print 'No shared elements'
        return lista1+lista2
    else:
        for el in shared:
            n1=lista1.count(el)
            for i in range(n1):
                lista1.remove(el) #Removes from lista1 the elements shared with lista2
            n2=lista2.count(el)   #as many times as they appear
            for j in range(n2):
                lista2.remove(el) #Removes from lista2 the elements shared with lista1
    result=lista1+lista2          #as many times as they appear
    print 'Addresses list 1:',num1
    print 'Addresses list 2:',num2
    print 'Useful Addresses:',len(list(set(result)))
    return (list(set(result)))

and this is an example of how it works:

data1="Davide John Kate Mary Susan"
data2="John Alice Clara Kate John Alex"
scan(data1,data2)
>Addresses list 1: 5
>Addresses list 2: 6
>Useful Addresses: 6
>['Alex', 'Susan', 'Clara', 'Alice', 'Mary', 'Davide']

Thanks for help :)

DavideL
  • 294
  • 1
  • 3
  • 15

3 Answers3

1

Use triple quotes around a string spanning multiple lines:

input1="""Davide
...:Michele
...:Giorgio
...:Paolo"""

They will then be seperated by returns ("\n"), so you could use inpu1.split('\n') to turn it into a list.

Using set objects, your operation becomes pretty simple. To get the elements in s1 that are not in s2 we can simply do s1 - s2. Union is just | and intersection is just & so all told we have.

s1 = set(input1.split('\n'))
s2 = set(input2.split('\n'))
adresses_in_only_one_file = (s1 | s2) - (s1 & s2)
irh
  • 358
  • 2
  • 10
  • Thanks, triple quotes and .split('\n') are what I was looking for. It's funny you did in three lines what I did in almost 30. Thanks! – DavideL May 04 '14 at 10:36
1

Expanding upon @irh's answer, you can then use sets get the symmetric difference between the two sets: (elements in list1 and list2 but not in both)

list1 = ['address1', 'address2', 'address3']

list1 = ['address5', 'address4', 'address3']

result = list(set(list1) ^ set(list2))

>>> print result
['address1', 'address2', 'address4', 'address5']     #note result might be jumbled but that shouldn't matter
sshashank124
  • 31,495
  • 9
  • 67
  • 76
  • Actually no problem about result being jumbled. The symmetric difference is useful, thanks. Now I'm having a look at all the Sets functions. – DavideL May 04 '14 at 10:37
  • @DavideL, Python has a __lot__ of powerful tools. Really helps to get to know them. Have fun. – sshashank124 May 04 '14 at 10:39
0
shared =[]
for el in lista1:
    if el in lista2:
        shared.append(el) #shared is the intersection of lista1 and lista2

In [10]: lista1=[1,2,3,4,5,6,7,8,9]

In [11]: lista2=[1,2,3,10,11,12,13]

In [12]: lista1=set(lista1)

In [13]: shared = lista1.intersection(lista2) # same as your loop above

In [14]: shared
Out[14]: {1, 2, 3}

If you want a list just use list(lista1.intersection(lista2))

for el in shared:
    n1=lista1.count(el)
    for i in range(n1):
        lista1.remove(el) #Removes from lista1 the elements shared with lista2
    n2=lista2.count(el)   #as many times as they appear
    for j in range(n2):
        lista2.remove(el)
result=lista1+lista2

         lista1=set(lista1) 
In [15]: list(lista1.symmetric_difference(lista2))
Out[15]: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13] # same as above.
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321