0

I wish to find substrings in strings that are in Urdu language. For example, suppose that I have a following string and substrings in the Urdu language:

fullstring = "آزاد دائرۃ المعارف، ویکیپیڈیا"

substring1 = "افریقی نژاد امریکی شہری حقوق کی تحریک (1955–1968) - آزاد دائرۃ ..."
substring2 = "Urdu English Translator حاصل کریں - Microsoft Store ur-PK"
substring3 = "ببر شیر - آزاد دائرۃ المعارف، ویکیپیڈیا"
substring4 = "اقوام متحدہ - ویکیپیڈیا"
substring5 = "واقعہ کربلا - آزاد دائرۃ المعارف"
substring6 = "Inaugural Address - Urdu | JFK Library"
substring7 = "دنیا میں امریکہ کے مقام کے بارے میں صدر بائیڈن کا خطاب - United ..."
substring8 = "ایران امریکہ کشیدگی: امریکی صدور اور جنگوں کی مبہم قانونی ..."

The objective is to search / find the words that are present in the fullstring in each of the substrings and then select the corresponding substring for further processing. Especially, the minimum words that are to be present in any substring should be "آزاد دائرۃ".

In the above given examples, substring1, substring3, substring4, and substring5 should be selected and returned (True), whereas, the rest of the substrings should not be selected (False).

I have written the following code to achieve the above given task:

fullstring = "آزاد دائرۃ المعارف، ویکیپیڈیا"
substring = "افریقی نژاد امریکی شہری حقوق کی تحریک (1955–1968) - آزاد دائرۃ ..."

# extract the part after the "-" part
s = substring.split("-")[1]
# remove any spaces if they are present
s = s.strip()

if s in fullstring:
   print("Found!")
else:
   print("Not found!")

The code is giving me Not found! response for all substrings. Whereas it should return Found! for substring1, substring3, substring4 and substring5, and Not found! for all other substrings as given above.

Please help me in achieving the substring search task as described above.

user.1234
  • 15
  • 6
  • please [edit] your question and add expected output for each substring, like "for substring1 - Found for substring2 - ......" I have almost got the solution but don't know that is your expected output or not! – imxitiz Jul 31 '21 at 11:07
  • Edited as per your comment. – user.1234 Jul 31 '21 at 14:20
  • solved problem completely. Inform me if it is working for you or not. :) – imxitiz Jul 31 '21 at 16:11
  • Excellent. Can you please explain the code in the exception handler (try-except) part? Also, an up vote to the question will be greatly appreciated. :-) – user.1234 Jul 31 '21 at 17:26

2 Answers2

1

You should try this:

fullstring = "آزاد دائرۃ المعارف، ویکیپیڈیا"
substring = "افریقی نژاد امریکی شہری حقوق کی تحریک (1955–1968) - آزاد دائرۃ ..."

# extract the part after the "-" part
s = substring.split("-")[1]
# remove any spaces if they are present
s = s.strip().replace(".","")

if s in fullstring:
   print("Found!")
else:
   print("Not found!")

Doing striped s is like آزاد دائرۃ ... but you don't have ... in fullstring so you're getting Not found.

Alternatively you can use .find() function like this :

fullstring = "آزاد دائرۃ المعارف، ویکیپیڈیا"
substring = "افریقی نژاد امریکی شہری حقوق کی تحریک (1955–1968) - آزاد دائرۃ ..."

# extract the part after the "-" part
s = substring.split("-")[1]
# remove any spaces if they are present
s = s.strip()

if fullstring.find(s)!=-1:
   print("Found!")
else:
   print("Not found!")

For all substring you can try this :

fullstring = "آزاد دائرۃ المعارف، ویکیپیڈیا"

substring1 = "افریقی نژاد امریکی شہری حقوق کی تحریک (1955–1968) - آزاد دائرۃ ..."
substring2 = "Urdu English Translator حاصل کریں - Microsoft Store ur-PK"
substring3 = "ببر شیر - آزاد دائرۃ المعارف، ویکیپیڈیا"
substring4 = "اقوام متحدہ - ویکیپیڈیا"
substring5 = "واقعہ کربلا - آزاد دائرۃ المعارف"
substring6 = "Inaugural Address - Urdu | JFK Library"
substring7 = "دنیا میں امریکہ کے مقام کے بارے میں صدر بائیڈن کا خطاب - United ..."
substring8 = "ایران امریکہ کشیدگی: امریکی صدور اور جنگوں کی مبہم قانونی ..."
allsub=[substring1,substring2,substring3,substring4,substring5,substring6,substring7,substring8]

for a in allsub:
    try:
        s=a.split("-")[1].strip(". ").strip()
    except IndexError:
        s=a.split("-")[0].strip(". ").strip()
    if fullstring.find(s)!=-1:
        print("Found!")
    else:
        print("Not found!")

Output :

Found!
Not found!
Found!
Found!
Found!
Not found!
Not found!
Not found!

I have created the list of all substring as allsub and checking as what you are doing. Additionally, I have done the try-except because in some substring there is no - and we selecting second element of list. So, sometimes it through errors. But if we use try-expect then it will execute except part rather than throwing error.

imxitiz
  • 3,920
  • 3
  • 9
  • 33
  • Unfortunately, same answer "Not found!". – user.1234 Jul 31 '21 at 06:27
  • I have tried the code that you had suggested. BTW, I cannot find the .find() function in your suggestion? – user.1234 Jul 31 '21 at 06:35
  • @user.1234 Sorry! my bad, I edited check it. – imxitiz Jul 31 '21 at 06:37
  • AttributeError: 'list' object has no attribute 'find' – user.1234 Jul 31 '21 at 06:41
  • `fullstring.find(s)` is this what you have tried? I think you are confused. I have added complete code in last edit check it. If you are getting that error then, you must have changed `fullstring` into list by doing split or somethings... @user.1234 if you're getting error then just do copy/paste my code and test. – imxitiz Jul 31 '21 at 06:41
  • Inconsistent output. Please check your code against each substring that I have given in the question. – user.1234 Jul 31 '21 at 08:04
  • Are you asking for all subsubstring? Sorry, I have thought for just one! Okay I will check and edited my question. @user.1234 it is working fine for that one substring right? – imxitiz Jul 31 '21 at 08:05
  • @user.1234 and additionally there is some language/character barrier, I am confused, I personally haven't found `آزاد دائرۃ` this in full string also. – imxitiz Jul 31 '21 at 08:15
  • I think there should be something related to character encoding? utf8? I don't know much about how to do that part. – user.1234 Jul 31 '21 at 08:31
  • Your characters are already in unicode so you don't have to do that! – imxitiz Jul 31 '21 at 10:50
  • `s = substring.split("-")[1]` you don't have `-` in your `substring8` then how can you select second element of that `substring8.split("-")`? – imxitiz Jul 31 '21 at 10:52
  • @user.1234 please post your expected output! For each substring. – imxitiz Jul 31 '21 at 11:04
  • Edited as per your comment. – user.1234 Jul 31 '21 at 14:30
1

Try it:

fullstring = "آزاد دائرۃ المعارف، ویکیپیڈیا"

substring1 = "افریقی نژاد امریکی شہری حقوق کی تحریک (1955–1968) - آزاد دائرۃ ..."
substring2 = "Urdu English Translator حاصل کریں - Microsoft Store ur-PK"
substring3 = "ببر شیر - آزاد دائرۃ المعارف، ویکیپیڈیا"
substring4 = "اقوام متحدہ - ویکیپیڈیا"
substring5 = "واقعہ کربلا - آزاد دائرۃ المعارف"
substring6 = "Inaugural Address - Urdu | JFK Library"
substring7 = "دنیا میں امریکہ کے مقام کے بارے میں صدر بائیڈن کا خطاب - United ..."
substring8 = "ایران امریکہ کشیدگی: امریکی صدور اور جنگوں کی مبہم قانونی ..."
allstrings = (substring1, substring2, substring3, substring4, substring5, substring6, substring7, substring8)
for a in allstrings:
    try:
        s = a.split("-")[1]
    except:
        s = a
    s = s.strip().replace(".", "")
    if s in fullstring:
       print("Found!")
    else:
       print("Not found!")

Output:

Found!
Not found!
Found!
Found!
Found!
Not found!
Not found!
Not found!
wjandrea
  • 28,235
  • 9
  • 60
  • 81
Radek Rojík
  • 104
  • 5