Validation for repeated sub-string in a dataframe

Question

Suppose I have a dataframe like this:

df = pd.DataFrame({'A': ["asdfg", "abcdef", "ababab", "ghhzgghz", "qwerty"], 'B': [1, 2, 3, 4, 5]})
df.head()

O/P:

A         B
asdfg     1
abcdef    2
ababab    3
ghhzgghz  4 
qwerty    5

How do I go around and validate if there are any repeated sub-string/s within column A?

A         B    C
asdfg     1    False
abcdef    2    False
ababab    3    True (matches for ab)
ghhzgghz  4    True (matches for gh)
qwerty    5    False

A general logic for return s in (s + s)[1:-1], but I want it to be streamlined for any general substring repetition within each of these rows.

jezrael · Answer 1 · 2020-03-20T06:16:52.817

0

Idea is create all possible substrings and then count them by Counter with check if at least one count >1:

from collections import Counter

#modified https://stackoverflow.com/a/22470047/2901002
def test_substrings(input_string):
  length = len(input_string)
  s = [input_string[i:j+1] for i in range(length) for j in range(i,length)]
  return any(x > 1 for x in Counter(s).values())

Another solution with easy way for modify minimal length of tested strings:

from itertools import chain, combinations

#changed first word asdfg to asdfa
df = pd.DataFrame({'A': ["asdfa", "abcdef", "ababab", "ghhzgghz", "qwerty"],
                   'B': [1, 2, 3, 4, 5]})

def test_substrings(input_string, N):
  s = chain(*[combinations(input_string,x) for x in range(N,len(input_string)+1)])
  return any(x > 1 for x in Counter(s).values())

df['C'] = df['A'].apply(lambda x: test_substrings(x, 1))
df['D'] = df['A'].apply(lambda x: test_substrings(x, 2))
print (df)
          A  B      C      D
0     asdfa  1   True  False
1    abcdef  2  False  False
2    ababab  3   True   True
3  ghhzgghz  4   True   True
4    qwerty  5  False  False

edited Mar 20 '20 at 06:16

answered Mar 20 '20 at 05:55

jezrael

822,522
95
1,334
1,252

Are there any ways where I can reduce the running time? Combination works with nCr rule, so to take this one and check through a database with, say close to 1M data, I would have to wait forever. – Mar 20 '20 at 06:53
@PukarAcharya - Unfortunately this operation is really complicated and need many combinations of substrings, so slow. Btw, do you need test substrings with length > 1 like `df['D'] = df['A'].apply(lambda x: test_substrings(x, 2))`, so `asdfa` is `False`? Or `asdfa` shouls be `True`, because double `a` ? – jezrael Mar 20 '20 at 07:23
it should be false. If the case is True, that would be like matching a subsequence. For substring, the character needs to be sequential. – Mar 20 '20 at 07:52
1

@PukarAcharya - So you mean substrings length has to be 2,3,... ? Not only one letter? – jezrael Mar 20 '20 at 07:53

Validation for repeated sub-string in a dataframe

1 Answers1