Python - Converting long list of addresses into list of strings and intersection of lists

Question

I have two very long text files (thousands of e-mail addresses, one per line) an I'm looking for a way to compare the two files and have an output with the adresses contained in the first file and in the second file but not in both of them (something like AUB/(A⋂B) in set theory). It would be pretty easy if I could use lists containing strings as input, like this

input1=['address1','address2',...,'addressn']

but since my text file is long and on different lines I should manually put each address among the ''. So I tried to use a single string with all the addresses separated by a space as input, and then to convert it into a list of strings. This is what I've come out with:

import numpy as np
from StringIO import StringIO

def conv(data):
    array1=np.genfromtxt(StringIO(data),dtype="|S50")
    lista1=[]
    for el in array1:
        lista1.append(el)
    return lista1

input1='address1 address2 ... addressn'

And this is what I get when I call the function

>conv(input1)
>['address1', 'address2', 'addressn']

It works, but I have a problem: inputs needs to be horizontal, so I can not copy my addresses from the text file and paste them into a string as I would get something like

input1="Davide
...:Michele
...:Giorgio
...:Paolo"

File "<ipython-input-4-6d70053fb94e>", line 1
  input1="Davide
             ^
SyntaxError: EOL while scanning string literal

How can I deal with this issue? Any suggestion to improve the code would be very apprecciated. I know almost nothing about the StringIO module, I came across it today for the first time, and I'm sure it's possible to write a much more efficient program than mine. This is the whole program by the way:

def scan(data1,data2): #Strings
    array1=np.genfromtxt(StringIO(data1),dtype="|S50")
    array2=np.genfromtxt(StringIO(data2),dtype="|S50")
    lista1=[]
    lista2=[]
    for el in array1:
        lista1.append(el)
    for el in array2:
        lista2.append(el) #lista1 and lista2 are lists containing strings
    num1,num2=len(lista1),len(lista2)
    shared=[]
    for el in lista1:
        if el in lista2:
            shared.append(el) #shared is the intersection of lista1 and lista2
    if len(shared)==0:
        print 'No shared elements'
        return lista1+lista2
    else:
        for el in shared:
            n1=lista1.count(el)
            for i in range(n1):
                lista1.remove(el) #Removes from lista1 the elements shared with lista2
            n2=lista2.count(el)   #as many times as they appear
            for j in range(n2):
                lista2.remove(el) #Removes from lista2 the elements shared with lista1
    result=lista1+lista2          #as many times as they appear
    print 'Addresses list 1:',num1
    print 'Addresses list 2:',num2
    print 'Useful Addresses:',len(list(set(result)))
    return (list(set(result)))

and this is an example of how it works:

data1="Davide John Kate Mary Susan"
data2="John Alice Clara Kate John Alex"
scan(data1,data2)
>Addresses list 1: 5
>Addresses list 2: 6
>Useful Addresses: 6
>['Alex', 'Susan', 'Clara', 'Alice', 'Mary', 'Davide']

Thanks for help :)

I imagine using sets https://docs.python.org/2/library/sets.html would be useful — Padraic Cunningham, May 03 '14 at 23:45
But I think there might be a problem using sets, defined as unordered collection of _unique_ elements: I may have duplicates in each starting list. — DavideL, May 04 '14 at 10:11
Nope, I just realized Sets automatically deletes duplicates, it works perfectly. That's an incredibly useful module I didn't know about, thanks so much. — DavideL, May 04 '14 at 10:32
I added an answer to show how to replace some of your loops with sets. — Padraic Cunningham, May 04 '14 at 10:37

irh · Accepted Answer · 2014-05-03T23:55:29.420

1

Use triple quotes around a string spanning multiple lines:

input1="""Davide
...:Michele
...:Giorgio
...:Paolo"""

They will then be seperated by returns ("\n"), so you could use inpu1.split('\n') to turn it into a list.

Using set objects, your operation becomes pretty simple. To get the elements in s1 that are not in s2 we can simply do s1 - s2. Union is just | and intersection is just & so all told we have.

s1 = set(input1.split('\n'))
s2 = set(input2.split('\n'))
adresses_in_only_one_file = (s1 | s2) - (s1 & s2)

edited May 03 '14 at 23:55

answered May 03 '14 at 23:46

irh

358
2
10

Thanks, triple quotes and .split('\n') are what I was looking for. It's funny you did in three lines what I did in almost 30. Thanks! – DavideL May 04 '14 at 10:36

score 1 · Answer 2 · answered May 03 '14 at 23:58

1

Expanding upon @irh's answer, you can then use sets get the symmetric difference between the two sets: (elements in list1 and list2 but not in both)

list1 = ['address1', 'address2', 'address3']

list1 = ['address5', 'address4', 'address3']

result = list(set(list1) ^ set(list2))

>>> print result
['address1', 'address2', 'address4', 'address5']     #note result might be jumbled but that shouldn't matter

answered May 03 '14 at 23:58

sshashank124

31,495
9
67
76

Actually no problem about result being jumbled. The symmetric difference is useful, thanks. Now I'm having a look at all the Sets functions. – DavideL May 04 '14 at 10:37
@DavideL, Python has a __lot__ of powerful tools. Really helps to get to know them. Have fun. – sshashank124 May 04 '14 at 10:39

score 0 · Answer 3 · answered May 04 '14 at 10:37

shared =[]
for el in lista1:
    if el in lista2:
        shared.append(el) #shared is the intersection of lista1 and lista2

In [10]: lista1=[1,2,3,4,5,6,7,8,9]

In [11]: lista2=[1,2,3,10,11,12,13]

In [12]: lista1=set(lista1)

In [13]: shared = lista1.intersection(lista2) # same as your loop above

In [14]: shared
Out[14]: {1, 2, 3}

If you want a list just use list(lista1.intersection(lista2))

for el in shared:
    n1=lista1.count(el)
    for i in range(n1):
        lista1.remove(el) #Removes from lista1 the elements shared with lista2
    n2=lista2.count(el)   #as many times as they appear
    for j in range(n2):
        lista2.remove(el)
result=lista1+lista2

         lista1=set(lista1) 
In [15]: list(lista1.symmetric_difference(lista2))
Out[15]: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13] # same as above.

Python - Converting long list of addresses into list of strings and intersection of lists

3 Answers3