3

I have a text that contains words that are enclosed by 2 spaces at the beginning and another 2 at the end like this:

"my_text_is__separated__like_this__example__"

so i want to retrieve 'separated' and 'example'.

I implemented it this way:

    String pattern = "\\s{2}(\\w+)\\s{2}";

    String t = getText();
    Pattern p = Pattern.compile(pattern);
    Matcher m = p.matcher(t);
    StringBuilder b = new StringBuilder();
    while (m.find()) {
        b.append(m.group(1) + "xxx\n");
    }
    Log.d("hmmmmm", b.toString());

but it doesn't work(m.find() is false).


edit: here's my text:

‏‏حَدَّثَنَا ‏ ‏الْحُمَيْدِيُّ عَبْدُ اللَّهِ بْنُ الزُّبَيْرِ ‏ ‏قَالَ حَدَّثَنَا ‏ ‏سُفْيَانُ ‏ ‏قَالَ حَدَّثَنَا ‏ ‏يَحْيَى بْنُ سَعِيدٍ الْأَنْصَارِيُّ ‏ ‏قَالَ أَخْبَرَنِي ‏ ‏مُحَمَّدُ بْنُ إِبْرَاهِيمَ التَّيْمِيُّ ‏ ‏أَنَّهُ سَمِعَ ‏ ‏عَلْقَمَةَ بْنَ وَقَّاصٍ اللَّيْثِيَّ ‏ ‏يَقُولُ سَمِعْتُ ‏ ‏عُمَرَ بْنَ الْخَطَّابِ ‏ ‏رَضِيَ اللَّهُ عَنْهُ ‏ ‏عَلَى الْمِنْبَرِ ‏ ‏قَالَ سَمِعْتُ رَسُولَ اللَّهِ ‏ ‏صَلَّى اللَّهُ عَلَيْهِ وَسَلَّمَ ‏ ‏يَقُولُ ‏ ‏إِنَّمَا الْأَعْمَالُ ‏ ‏بِالنِّيَّاتِ ‏ ‏وَإِنَّمَا لِكُلِّ امْرِئٍ مَا نَوَى فَمَنْ كَانَتْ هِجْرَتُهُ إِلَى دُنْيَا ‏ ‏يُصِيبُهَا ‏ ‏أَوْ إِلَى امْرَأَةٍ يَنْكِحُهَا فَهِجْرَتُهُ إِلَى مَا هَاجَرَ إِلَيْهِ‏.

'سُفْيَانُ' and '‏بِالنِّيَّاتِ' for example should be among the outputs


note: in the example, I replaced the spaces with (_) so it becomes more visible.

note: my text is in Arabic.

edit: turns out it was not separated with double spaces, see the answer below.

mhashim6
  • 527
  • 6
  • 19

1 Answers1

3

Java's Pattern defines "word character" \w as [a-zA-Z_0-9], so arabic text won't match (sidenote: european accents do not match either, e.g. "éèö").

According to this answer you can use [\u0600-\u06FF] for Arabic instead of \w.
According to that answer you can use \p{InArabic}, which seems better.

In addition, your text snippet does not contain 2 consecutive whitespace characters, so \s{2} won't get any match. It seems what appears as "double spaces" to the eye are actually spaces followed by unicode right-to-left mark, which is 200F in hexadecimal. So that can be matched with \\s\\x{200f}(\\p{InArabic}+)\\s\\x{200f}. Example:

    System.out.println(Arrays.toString(new boolean[] {
            "###  hey  ###".matches(".*\\s{2}\\w+\\s{2}.*"),
            "###  tût  ###".matches(".*\\s{2}\\w+\\s{2}.*"),
            "###  لتَّيْم  ###".matches(".*\\s{2}\\w+\\s{2}.*"),
            "###  لتَّيْم  ###".matches(".*\\s{2}\\p{InArabic}+\\s{2}.*")
    }));
    Matcher matcher = Pattern.compile("\\s\\x{200f}(\\p{InArabic}+)\\s\\x{200f}").matcher(getText());
    while (matcher.find()) {
        System.out.println(matcher.group(1));
    }

where getText() returns your text snippet, this prints:

[true, false, false, true]
سُفْيَانُ
يَقُولُ
بِالنِّيَّاتِ
يُصِيبُهَا

Now I'm not sure if it's a good thing to expect your text to contain such markers around specific words, and to explicitly match for that :-/

I don't know either how regexr.com works, as I thought for JavaScript \w meant the same as Java (and I see no network roundtrip so it must be implemented with JS, probably with some layer of transformation, though). Even their own embedded doc says this about \w:

Matches any word character (alphanumeric & underscore). Only matches low-ascii characters (no accented or non-roman characters). Equivalent to [A-Za-z0-9_]

Hugues M.
  • 19,846
  • 6
  • 37
  • 65
  • (sorry there was a problem with my connection) so I did the following: String pattern = "\\s{2}([\\u0600-\\u06FF]+)\\s{2}"; it didn't work either, ALSO, regexr.com didn't even accept my Arabic text, but it worked fine with english text. – mhashim6 Jul 10 '17 at 18:06
  • See my edit, for unicode character class you need a single backslash before `u`, so `\u` – Hugues M. Jul 10 '17 at 18:09
  • it didn't work, is it because almost every word starts/ends with a diacritic letter? – mhashim6 Jul 10 '17 at 18:16
  • Interesting, I see that too. Your text does not contain 2 subsequent spaces, there are other invisible characters in between. See [this runnable example](http://ideone.com/ohtJ1Y), where I added more spaces manually, it has only 1 match. Click on "edit", see the weird dots in place of spaces. If I save your text to a file, then `grep -E "\s\s"` finds no match. Those are probably the diacritic combination characters you speak of. No idea, sorry. – Hugues M. Jul 10 '17 at 19:45
  • @mh6 Please check new edit, for 2 new things: a) there is a nicer way with `\p{InArabic}` ---- b) maybe what you are after is `\\b\\s(\\p{InArabic}+)\\s\\b`, see updated example. – Hugues M. Jul 11 '17 at 09:42
  • it kinda returns the opposite output! :) though i'm starting to doubt that these are double spaces, as Ideone is displaying them as weird dots [too](https://ideone.com/wJTjDf) (in edit mode/?). – mhashim6 Jul 11 '17 at 15:01
  • I'm pretty sure those are not double spaces, if you want to isolate those you'll need to figure out what exactly they are, and then figure out an appropriate filter. Sorry my knowledge of Arabic is limited to pretty much 0 :) – Hugues M. Jul 11 '17 at 15:04
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/148914/discussion-between-mh6-and-hugues-m). – mhashim6 Jul 11 '17 at 15:17