1

I have a string with characters repeated. My Job is to find starting Index and ending index of each unique characters in that string. Below is my code.

import re
x = "aaabbbbcc"
xs = set(x)
for item in xs:
     mo = re.search(item,x)
     flag = item
     m = mo.start()
     n = mo.end()
     print(flag,m,n)

Output :

a 0 1
b 3 4
c 7 8

Here the end index of the characters are not correct. I understand why it's happening but how can I pass the character to be matched dynamically to the regex search function. For instance if I hardcode the character in the search function it provides the desired output

x = 'aabbbbccc'
xs = set(x)
mo = re.search("[b]+",x)
flag = item
m = mo.start()
n = mo.end()
print(flag,m,n)

output:

b 2 5

The above function is providing correct result but here I can't pass the characters to be matched dynamically. It will be really a help if someone can let me know how to achieve this any hint will also do. Thanks in advance

Toto
  • 89,455
  • 62
  • 89
  • 125
Saikat
  • 403
  • 1
  • 7
  • 19
  • I have found the solution to my specific problem, still if anyone can solve it through regex it will really be helpful. My code is below ``` x = "aaabbbbccc" xs = set(x) for item in xs: start = x.index(item) end = x.rindex(item) print(start,end) ``` – Saikat Sep 21 '19 at 15:27
  • If you are using pyspark dataframe, with spark 2.40+, you can solve this directly from Spark SQL built-in functions, no need to write udf. – jxc Sep 21 '19 at 15:38
  • what if `x = "aaabbbbcccaa"`, how do you want to count `a` – jxc Sep 21 '19 at 18:29
  • @jxc thanks for pointing it out. Logical error from my end. The output in that case should be ``` a 0 2 b 3 6 c 7 9 a 10 11 ``` can you please point out the function to be used in this case in pyspark. I just started with pyspark so tried solving it using python – Saikat Sep 21 '19 at 18:52
  • Not a single functions. I would use *regexp_replace* + *split* to create an array, and use *transform* and *aggregate* to calculate the aggregated length of each array element, then calculate the required numbers. – jxc Sep 21 '19 at 23:13
  • @jxc thanks for the tip. will try – Saikat Sep 21 '19 at 23:16

2 Answers2

1

String literal formatting to the rescue:

import re

x = "aaabbbbcc"
xs = set(x)
for item in xs:
    # for patterns better use raw strings - and format the letter into it
    mo = re.search(fr"{item}+",x)  # fr and rf work both :) its a raw formatted literal
    flag = item
    m = mo.start()
    n = mo.end()
    print(flag,m,n)  # fix upper limit by n-1

Output:

a 0 3   # you do see that the upper limit is off by 1?
b 3 7   # see above for fix
c 7 9

Your pattern does not need the [] around the letter - you are matching just one anyhow.


Without regex1:

x = "aaabbbbcc"
last_ch = x[0]
start_idx = 0
# process the remainder
for idx,ch in enumerate(x[1:],1):
    if last_ch == ch:
        continue
    else:
        print(last_ch,start_idx, idx-1)
        last_ch = ch
        start_idx = idx
print(ch,start_idx,idx)

output:

a 0 2   # not off by 1
b 3 6
c 7 8

1RegEx: And now you have 2 problems...

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
1

Looking at the output, I'm guessing that another option would be,

import re
x = "aaabbbbcc"
xs = re.findall(r"((.)\2*)", x)

start = 0
output = '' 
for item in xs:
    end = start + len(item[0])
    output += (f"{item[1]} {start} {end}\n")
    start = end

print(output)

Output

a 0 3
b 3 7
c 7 9

I think it'll be in the Order of N, you can likely benchmark it though, if you like.

import re, time

timer_on = time.time()

for i in range(10000000):
    x = "aabbbbccc"
    xs = re.findall(r"((.)\2*)", x)

    start = 0
    output = '' 
    for item in xs:
        end = start + len(item[0])
        output += (f"{item[1]} {start} {end}\n")
        start = end

timer_off = time.time()

timer_total = timer_off - timer_on

print(timer_total)
Emma
  • 27,428
  • 11
  • 44
  • 69