6

In C++, I have a text file that contains Arabic text like:

شكلك بتعرف تقرأ عربي يا ابن الذين

and I want to parse each line of this file into a string and use string functions on it (like substr, length, at...etc.) then print some parts of it to an output file.

I tried doing it but it prints some garbage characters like "\'c7\'e1\'de\'d1\" Is there any library to support Arabic characters?

edit: just adding the code:

#include <iostream>
#include <fstream>
using namespace std;
int main(){
  ifstream ip;
  ip.open("d.rtf");
  if(ip.is_open() != true){
    cout<<"open failed"<<endl;
    return 0;
  }
  string l;
  while(!ip.eof()){
    getline(ip, l);
    cout<<l<<endl;
  }

  return 0;
}

Note: I still need to add some processing code like

if(l == "كلام بالعربي"){
    string s = l.substr(0, 4);       
    cout<<s<<" is what you are looking for"<<endl;
 }
CSawy
  • 904
  • 2
  • 14
  • 25
  • 1
    you're asking the wrong question. C# (and any other language) couldn't care less if the file contains arabic, english, french, or klingon. They're interested in character sets. You need to figure out what charset your file is in (e.g. utf-8?) – Marc B May 26 '14 at 14:19
  • 1
    Not really answering your question but still: do **not** use `eof()` to detect the end of the loop! You **always** need to test **after** reading if the read was successful: `while (std::getline(ip, l)) { ... }` – Dietmar Kühl May 26 '14 at 14:43

4 Answers4

2

You need to find out which text encoding the file is using. For example, to read an UTF-8 file as a wchar_t you can (C++11):

std::wifstream fin("text.txt");
fin.imbue(std::locale("en_US.UTF-8"));
std::wstring line;
std::getline(fin, line);
std::wcout << line << std::endl;
vz0
  • 32,345
  • 7
  • 44
  • 77
  • 1
    Isn't `imbue()`ing the locale after the file is opened too late? After all, the open may already fill the buffer which would use the wrong `std::codecvt<...>` facet. Also, who knows what [unchangable] encoding `std::cout` or `std::wcout` use... – Dietmar Kühl May 26 '14 at 14:45
0

The best way to deal with this, in my opinion, is to use some UNICODE helper. The strings in C or even in C++ are just an array of bytes. When you do, for example, a strlen() [C] or somestring.length() [C++] you will only have the number os bytes of that string instead of number os characters.

Some auxiliar functions can be used help you on it, like mbstowcs(). But my opinion is that they are kinda old and hard to use.

Another way is to use C++11, that, in theory, has support for many things related to UTF-8. But I never saw it working perfectly, at least if you need to be multi-platform.

The best solution I found is to use ICU library. With this I can work on UTF-8 strings easily and with the same "charm" as working with a regular std::string. You have a string class with methods, for length, substrings and so on... and it's very portable. I use it on Window, Mac and Linux.

Wagner Patriota
  • 5,494
  • 26
  • 49
  • Thanks. will try it now. i can't find its download files for Mac 10.8. Is it under Red Hat linux? http://site.icu-project.org/download/53#TOC-ICU4C-Download – CSawy May 26 '14 at 16:51
  • you are looking to binaries... get the source code: http://download.icu-project.org/files/icu4c/53.1/icu4c-53_1-src.tgz – Wagner Patriota May 26 '14 at 17:48
0

You can use Qt too .

Simple example :

#include <QDebug>
#include <QTextStream>
#include <QFile>
int main()
{
   QFile file("test.txt");
   file.open(QIODevice::ReadOnly | QIODevice::Text);
   QTextStream stream(&file);
   QString text=stream.readAll();
   if(text == "شكلك بتعرف تقرأ عربي يا ابن الذين")
       qDebug()<<",,,, ";
}
uchar
  • 2,552
  • 4
  • 29
  • 50
0

It is better to process an Arabic text line by line. To get all lines of Arabic text from file, try this

        std::wifstream fin("arabictext.txt");
        fin.imbue(std::locale("en_US.UTF-8"));
        std::wstring line;
        std::wstring text;
        
        
        while ( std::getline(fin, line) )
        {
            text= text+ line + L"\n";
        }