0

I'm working with large files, in this case a file that has one word per line and over 300k lines. I'm trying to find a way to obtain the most common patterns present in the words of the file. For example, if I treat it as a list (small example)

a = [122, pass123, dav1, 1355122] it should recognize "122" is commonly used.

It is important to do it efficiently because otherwise the processing time will be too much taking into account the number of words to check.

I have tried this, which I saw from this post Python finding most common pattern in list of strings, but in my case it only displays the most common characters in the file:

matches = Counter(reduce(lambda x,y: x+y ,map (lambda x : x ,list_of_words))).most_common() where list_of_words is a list containing all the words in the file.

Is there any way to obtain string matches starting from 3 characters instead of only getting one char?

Thank you all for your help :)

David
  • 11
  • `a = [122, pass123, dav1, 1355122]` is not valid Python syntax. – BrokenBenchmark Jan 07 '23 at 17:57
  • Yes, I forgot to include the “” in the post – David Jan 07 '23 at 18:12
  • @David, can you paste a more representative and bigger array of words (and expected common patterns for it) so I can test my solution for you? – RomanPerekhrest Jan 07 '23 at 18:25
  • @RomanPerekhrest Sure. I can give you the file but not the common patterns since i was not able to get them... https://drive.google.com/file/d/1iFNF3w0xsU113tztIciu2dMPEDG-g-y4/view?usp=share_link – David Jan 08 '23 at 10:44

3 Answers3

0

I tried this out:

def catalogue_patterns(word, min_len, max_len):
    n_chars = len(word)
    patterns = Counter()
    for start in range(n_chars - min_len):
        for end in range(
                start + min_len, min(start + max_len + 1, n_chars)):
            seq = word[start:end]
            patterns[seq] += 1
    return patterns

which for catalogue_patterns('abcabcd', 3, 5) returns:

Counter({'abc': 2,
         'abca': 1,
         'abcab': 1,
         'bca': 1,
         'bcab': 1,
         'bcabc': 1,
         'cab': 1,
         'cabc': 1})

Then

def catalogue_corpus(corpus, min_len, max_len):
    patterns = Counter()
    for word in corpus:
        patterns += catalogue_patterns(word, min_len, max_len)
    return patterns

patterns = catalogue_corpus(corpus, 3, 5)
print(patterns.most_common())

(where corpus would be your list of words). I ran it on a list of 100,000 artificially generated words and it took about 19s. In a real corpus, where certain words are repeated frequently, you can memoize the function for additional speed. You can do this easily in python using lru_cache:

from functools import lru_cache

@lru_cache()
def catalogue_patterns_memoized(word, min_len, max_len):
    n_chars = len(word)
    patterns = Counter()
    for start in range(n_chars - min_len):
        for end in range(
                start + min_len, min(start + max_len + 1, n_chars)):
            seq = word[start:end]
            patterns[seq] += 1
    return patterns

If speed is really an issue though, you can get much faster speeds doing this in C (or Cython) instead.

As a side note:

Counter(
    reduce(
        lambda x, y: x + y,
        map(lambda word: catalogue_patterns(word, 3, 5), corpus)))

took about 8x as long.

My artificial test corpus was generated using:

import numpy as np

def generate_random_words(n, mean_len=5):
    probs = np.array([10, 2, 6, 4, 20, 3, 4])
    probs = probs / probs.sum()
    return [
        ''.join(
            np.random.choice(
                list('abcdefg'),
                size=np.random.poisson(mean_len),
                p=probs))
        for _ in range(n)]

corpus = generate_random_words(100_000, 5)
print(corpus[:10])
  • Hi Damian, thank you very much. For some reason, I'm not getting the same timings as you. I'm using a wordlist of 84k words but it's taking a lot of time (I ran it for 5 min and it didn't finish) the speed I'm getting is around 10k words/min. I can't use C or Cython because this functionality is part of a more complete tool in Python i'm implementing – David Jan 08 '23 at 10:36
  • Could you try using the file I provided above? It needs to be converted to a list however. This one has 300k words – David Jan 08 '23 at 10:46
0

Try nltk.probability.FreqDist module to find the number of times each token has occurred:

import nltk

n = 3
with open('your.txt') as f:
    tokens = nltk.tokenize.word_tokenize(f.read())
    freq_dist = nltk.FreqDist(t.lower() for t in tokens)

    most_common = [(w, c) for w, c in freq_dist.most_common() if len(w) == n and c > 1]
    print(most_common)

The output for your current file:

[('new', 9), ('san', 5), ('can', 2), ('not', 2), ('gon', 2), ('las', 2), ('los', 2), ('piu', 2), ('usc', 2)]
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
0

After some playing around and searching here on Stackoverflow, I landed on a much faster method:

from collections import Counter
import re

import pandas as pd


INPATH = 'path/to/cain.txt'
OUTPATH = 'path/to/patterns.csv'  # save output
MIN_LEN = 3
MAX_LEN = 5


def re_method():
    print('Starting regex method....')
    with open(INPATH) as f:
       corpus = f.read().split('\n')
    alphanum = '\\w'
    patterns = Counter()
    for n in range(MIN_LEN, MAX_LEN + 1):
    matches = re.findall(f'(?=({alphanum * n}))', str(corpus))
        patterns += {k: v for k, v in Counter(matches).items()}
    df = sort_patterns(patterns)
    print(df.head(30))
    df.to_csv(OUTPATH, index=False)


def sort_patterns(patterns):
    df = pd.DataFrame({'pattern': patterns.keys(), 'count': patterns.values()})
    df.sort_values('count', ascending=False, ignore_index=True, inplace=True)
    return df


if __name__ == '__main__':
    re_method()

Runs your full input file in just under 17s.

Special thanks to this @Otto Allmendinger's answer to this post: How to find overlapping matches with a regexp?