You need to include the specific character ranges for Arabic and Persian characters. \w
can be expressed as [A-Za-z0-9_]
. You can include any character range in that same character class.
From Arabic script in Unicode:
- Arabic (0600—06FF, 255 characters)
- Arabic-Indic Digits (0660-0669)
- Extended Arabic-Indic Digits (06F0-06F9)
- Arabic Supplement (0750—077F, 48 characters)
- Arabic Extended-A (08A0—08FF, 50 characters)
- Arabic Presentation Forms-A (FB50—FDFF, 611 characters)
- Arabic Presentation Forms-B (FE70—FEFF, 140 characters)
- Rumi Numeral Symbols (10E60—10E7F, 31 characters)
- Arabic Mathematical Alphabetic Symbols (1EE00—1EEFF, 143 characters)
The basic Arabic range encodes the standard letters and diacritics,
but does not encode contextual forms (U+0621–U+0652 being directly
based on ISO 8859-6); and also includes the most common diacritics and
Arabic-Indic digits. The Arabic Supplement range encodes letter
variants mostly used for writing African (non-Arabic) languages. The
Arabic Extended-A range encodes additional Qur'anic annotations and
letter variants used for various non-Arabic languages. The Arabic
Presentation Forms-A range encodes contextual forms and ligatures of
letter variants needed for Persian, Urdu, Sindhi and Central Asian
languages. The Arabic Presentation Forms-B range encodes spacing forms
of Arabic diacritics, and more contextual letter forms. The
presentation forms are present only for compatibility with older
standards, and are not currently needed for coding text. The Arabic
Mathematical Alphabetical Symbols block encodes characters used in
Arabic mathematical expressions.
I think you should include:
- In
\w
: 1 and 3
- In
\d
: 1.1
I believe this would include English, Arabic and Persian:
/(\w+:\/\/)?([-.a-z0-9_\u0600-\u06FF\u08A0-\u08FF]+)(\.\w+)(:\d{1,5})?(\/\S*)?/i
- I am assuming you can't have Arabic characters in the protocol, the extension and the port number, only in the domain.