0

How can I split Arabic words based on connected Ligature in SQL Server, e.g

أخبارى

أ - خبا - ر - ى

أخذتهم

أ -  خذ - تهم

I have tried many solution but either they are based on spaces or any deliminator, in my case there is no space.

iamdave
  • 12,023
  • 3
  • 24
  • 53
Asjal Rana
  • 143
  • 13
  • 1
    Put your sample data together, and then an expected output of that sample data to visualise your logic please. – Matt Oct 10 '17 at 13:21
  • @Matt Are the examples in the question not both the sample data and expected output? – iamdave Oct 10 '17 at 13:48
  • @iamdave I did think that, but wanted OP to confirm – Matt Oct 10 '17 at 13:50
  • I have updated questions with sample data and its output in next column. – Asjal Rana Oct 10 '17 at 13:52
  • Out of curiosity why do you need to split the words ups? – Mazhar Oct 10 '17 at 13:55
  • Dear @Cool_Br33ze Its a university project and I need this data and I have 1 million words to make them separate. – Asjal Rana Oct 10 '17 at 14:08
  • 1
    Do you have a list of all the possible Ligatures that you could search for within the words you need to split? – iamdave Oct 10 '17 at 14:11
  • I don't have a separated ligatures list. That's why I am seeking help from expert. – Asjal Rana Oct 10 '17 at 14:24
  • Test processing is *not* one of T-SQL's particular strengths. Any reason why you're doing this, specifically, down in the database? – Damien_The_Unbeliever Oct 10 '17 at 14:27
  • @Damien_The_Unbeliever I have to save them separately in columns to minimize dictionary size from more than million to few thousands – Asjal Rana Oct 10 '17 at 15:22
  • @iamdave No I do not have otherwise I could easily make them separate from connected words – Asjal Rana Oct 10 '17 at 15:24
  • @AsjalRana What is stopping you getting such a list? – iamdave Oct 10 '17 at 15:30
  • @iamdave Can you prepare a such list? How can I do that. I downloaded a completed dictionary and doesnt have divided characters. Arabic language has connected words which are again assembled by characters. Such as أ خبر هYou can further divide to أ خ ب ر ه which is the combination of all these characters – Asjal Rana Oct 11 '17 at 06:24
  • @AsjalRana Can I prepare it for you? No. Unless you're willing to pay me (or these people: https://link.springer.com/article/10.1007/s13735-017-0127-x) to do so. If you can't find your list of possible Ligatures this seems an impossible task without further delving into how they are handled within the Unicode spec – iamdave Oct 11 '17 at 08:05
  • @AsjalRana Also, per the question here: https://stackoverflow.com/questions/7803960/arabic-source-unicode-to-final-display-unicode there is no way to actually detect the joining of the characters from the data, as this is handled by the rendering engine that displayed the characters. Within Unicode, both your joined and separated strings are made up of the exact same Unicode characters, they are just rendered differently. It seems the only way you could achieve what you want is with a database of all possible combinations and a very slow lookup function. – iamdave Oct 11 '17 at 08:13

1 Answers1

0

This is very rudimentary and should only be used as a starting point.

This is searching for each ligature and replacing that with an addition of a space.

DECLARE @word NVARCHAR(100) = N'أخبارى' 
SELECT LEN(@word), @word
SELECT REPLACE(REPLACE(REPLACE(REPLACE(@word, N'أ', N'أ '), N'ى', N'ى '), N'ر', N'ر ' ), N'خب', N'خب ') 
SELECT LEN(REPLACE(REPLACE(REPLACE(REPLACE(@word, N'أ', N'أ '), N'ى', N'ى '), N'ر', N'ر ' ), N'خب', N'خب ') )

You can create a table with all possible ligatures and query that using dynamic SQL following the above pattern.. I will provide an example to show what I mean

Mazhar
  • 3,797
  • 1
  • 12
  • 29
  • I think I get your idea to do it, But I do not have who Arabic words in parts to replace and get the desired result. – Asjal Rana Oct 10 '17 at 15:23
  • 3
    Code-only answers are discouraged because they do not explain how they resolve the issue in the question. Please update your answer to explain what this does and how it addresses the problem. See [How do I write a good answer](https://stackoverflow.com/help/how-to-answer) – FluffyKitten Oct 11 '17 at 10:14
  • 1
    Thank you for this code snippet, which might provide some limited, immediate help. A proper explanation [would greatly improve](//meta.stackexchange.com/q/114762) its long-term value by showing *why* this is a good solution to the problem, and would make it more useful to future readers with other, similar questions. Please [edit] your answer to add some explanation, including the assumptions you've made. – Toby Speight Oct 11 '17 at 12:09
  • @Cool_Br33ze arabic words in the middle ends with following words only and I was able to solve it by using your logic. select REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@words, N'أ', N'أ '), N'ى', N'ى '), N'ر', N'ر ' ), N'آ', N'آ '),N'ا',N'ا '),N'و',N'و '),N'ذ',N'ذ '),N'د',N'د '),N'ز',N'ز ') – Asjal Rana Oct 11 '17 at 12:52