Anagram from large file

Question

I have a file having 10,000 word on it. I wrote a program to find anagram word from that file but its taking too much time to get to output. For small file program works well. Try to optimize the code.

count=0
i=0
j=0
with open('file.txt') as file:
  lines = [i.strip() for i in file]
  for i in range(len(lines)):
      for j in range(i):
          if sorted(lines[i]) == sorted(lines[j]):
              #print(lines[i])
              count=count+1
              j=j+1
              i=i+1
print('There are ',count,'anagram words')

If you sort the whole file content once you would not have to check each pair of entries but just the entries starting with the same letter for example. Also, once sorted you can start the inner loop at the index the outer loop is at, because earlier matches will have been found already. — LuckyJosh, Apr 28 '19 at 11:58
Are you trying to anagram the file against itself or against a specific word? That nested loop is definitely not needed to do the former... — Jon Clements, Apr 28 '19 at 12:00
do you consider duplicate anagrams? like what if the same couple is present more than once? — SuperKogito, Apr 28 '19 at 12:15

wovano · Answer 1 · 2023-08-14T10:10:31.807

I don't fully understand your code (for example, why do you increment i and j inside the loop?). But the main problem is that you have a nested loop, which makes the runtime of the algorithm O(n^2), i.e. if the file becomes 10 times as large, the execution time will become (approximately) 100 times as long.

So you need a way to avoid that. One possible way is to store the lines in a smarter way, so that you don't have to walk through all lines every time. Then the runtime becomes O(n). In this case you can use the fact that anagrams consist of the same characters (only in a different order). So you can use the "sorted" variant as a key in a dictionary to store all lines that can be made from the same letters in a list under the same dictionary key. There are other possibilities of course, but in this case I think it works out quite nice :-)

So, fully working example code:

#!/usr/bin/env python3

from collections import defaultdict

d = defaultdict(list)
with open('file.txt') as file:
    lines = [line.strip() for line in file]
    for line in lines:
        sorted_line = ''.join(sorted(line))
        d[sorted_line].append(line)

anagrams = [d[k] for k in d if len(d[k]) > 1]
# anagrams is a list of lists of lines that are anagrams

# I would say the number of anagrams is:
count = sum(map(len, anagrams))
# ... but in your example you're not counting the first words, only the "duplicates", so:
count -= len(anagrams)
print('There are', count, 'anagram words')

UPDATE

Without duplicates, and without using collections (as requested by OP in a comment, although I strongly recommend to use it):

#!/usr/bin/env python3

d = {}
with open('file.txt') as file:
    lines = [line.strip() for line in file]
    lines = set(lines)  # remove duplicates
    for line in lines:
        sorted_line = ''.join(sorted(line))
        if sorted_line in d:
            d[sorted_line].append(line)
        else:
            d[sorted_line] = [line]

anagrams = [d[k] for k in d if len(d[k]) > 1]
# anagrams is a list of lists of lines that are anagrams

# I would say the number of anagrams is:
count = sum(map(len, anagrams))
# ... but in your example your not counting the first words, only the "duplicates", so:
count -= len(anagrams)
print('There are', count, 'anagram words')

This is perfectly working.But i am looking to get solution without using collection. — Nabeinz kc, Apr 29 '19 at 11:38
@Nabeinzkc, what is the reason you don't want to use collections? They are in the standard library, so there are no disadvantages I can think of. Of course it is possible to do it without, but it costs a few extra lines of code, increasing your inner loop from 2 to 5 lines, of which most lines of code are not necessary and only distract from the main functionality (i.e. making the code less readable). — wovano, Apr 29 '19 at 12:10
I've updated the answer to remove duplicates (as you mentioned elsewhere in this thread) and without using collections. — wovano, Apr 29 '19 at 12:16

score -1 · Accepted Answer · answered Apr 28 '19 at 12:28

-1

Well it is unclear whether you account for duplicates or not, however if you don't you can remove duplicates from your list of words and that will spare you a huge amount of runtime in my opinion. You can check for anagrams and then use sum() to get the their total number. This should do it:

def get_unique_words(lines):
    unique = [] 
    for word in " ".join(lines).split(" "): 
        if word not in unique:
            unique.append(word)
    return unique 

def check_for_anagrams(test_word, words):
    return sum([1 for word in words if (sorted(test_word) == sorted(word) and word != test_word)])

with open('file.txt') as file:
  lines = [line.strip() for line in file]


unique = get_unique_words(lines)
count  = sum([check_for_anagrams(word, unique) for word in unique])

print('There are ', count,'unique anagram words aka', int(count/2), 'unique anagram couples')

answered Apr 28 '19 at 12:28

SuperKogito

2,998
3
16
37

1

Removing duplicates (if that is the requirement) will indeed decrease the runtime, but I doubt it will be a huge difference. Besides, the algorithm is still _O(n^2)_, so it doesn't scale well (i.e. it will be slow for large files). The algorithm need to be changed in order to decrease the runtime much more, even for large files. – wovano Apr 28 '19 at 13:47
1

I tested this solution with an input file of 10.000 lines and it took approximately 3 minutes on my machine, while my solution runs in less than a second for 100.000 lines of input. There really _is_ an advantage of using the standard library effectively ;-) – wovano Apr 30 '19 at 15:46
Well I did not run any tests so it is for the OP to judge the efficiency of the proposed solutions. Though your code might have performed better for one sample, you should keep in mind that for redundant text mine might perform better. However, I agree that using the standard library is probably more efficient, as in most cases. I think the best solution is probably a combination of both. – SuperKogito Apr 30 '19 at 15:54

Anagram from large file

2 Answers2