0

I am building a language analysis program I have a program which counts the words in text and give the ratio of every word in text as a output, but this program can not work on file containing Urdu text. how can I make it work

1 Answers1

1

Encoding

Urdu may be presented in two¹ forms: Unicode and Code Page 868. This is convenient to you because the two ranges do not overlap. It is inconvenient because the Unicode code range is U+0600 – U+06FF, which means encoding is an issue:

  • CP-868 will encode each one as a single-byte value in the range 128–252
  • UTF-8 will encode each one as a two-byte sequence with bits 110x xxxx and 10xx xxxx
  • UTF-16 encodes every character as two-byte entities
  • UTF-32 encodes every character as four-byte entities

This means that you should be aware of encoding issues, and for an easy life, use UTF-16 internally (std::u16string), and accept files as (default) UTF-8 / CP-868, or as UTF-16/32 if there is a BOM indicating such.

Your other option is to simply require all input to be UTF-8 / CP-868.

¹ AFAIK. There may be other ways of storing Urdu text.   Three forms. See comments below.

Word separation

As you know, the end of a word is generally marked with a special letter form.

So, all you need is a table of end-of-word letters listing letters in both the CP-868 range and the Unicode Arabic text range.

Then, every time you find a space or a letter in that table you know you have found the end of a word.

Histogram

As you read words, store them in a histogram. For C++ a map <u16string, size_t> will do. The actual content of each word does not matter.

After that you have all the information necessary to print stats about the text.


Edit

The approach presented above is designed to be simple at the cost of some correctness. If you are doing something for the workplace, for example, and assuming it matters, you should also consider:

Normalizing word forms

For example, the same word may be presented in standard Arabic text codes or using the Urdu-specific codes. If you do not convert to the Urdu equivalent characters then you will have two words that should compare equal but do not.

Use something internally consistent. I recommend UZT, as it is the most complete Urdu text representation. You will also need an additional lookup for the original text representation from the UZT representation.

Dictionaries

As complete a dictionary (as an unordered_set <u16string>) of words in Urdu as you can get.

This is how it is done with languages like Japanese, for example, to find breaks between words.

Then use the dictionary to find all the words you can, and fall back on letterform recognition and/or spaces for what remains.

Dúthomhas
  • 8,200
  • 2
  • 17
  • 39
  • https://en.wikipedia.org/wiki/Urdu_alphabet#Computers_and_the_Urdu_alphabet lists some more encodings for Urdu, but I have no idea what the common ones are. – user17732522 Feb 24 '22 at 07:05
  • Well, crud. That would make some analysis of the text required to distinguish UZT from CP-868. – Dúthomhas Feb 24 '22 at 07:16