Regex failing to match the punycode url

Question

I was having the url which on converting to punycode has suffix as xn---- which all the regex present in ruby libraries fails to match. Currently I am using validates_url_format_of ruby library. Example Url: "https://www.θεραπευτικη-κανναβη.com.gr" Punycode url: "https://www.xn----ylbbafnbqebomc7ba3bp1ds.com.gr"

So can you please suggest that is there any issue in the regex in the library or the issue lies in the conversion to punycode.

As per the punycode conversion rules the suffix always is xn--. So can anyone suggest what extra two -- means here

score 0 · Answer 1 · answered Jun 10 '19 at 19:28

0

"https://www.xn----ylbbafnbqebomc7ba3bp1ds.com.gr".match(/https?:\/\/w*\.xn----.*/)
=> #<MatchData "https://www.xn----ylbbafnbqebomc7ba3bp1ds.com.gr">

Note the url matcher is not perfect

answered Jun 10 '19 at 19:28

dileep nandanam

2,827
18
20

score 0 · Answer 2 · edited Aug 09 '22 at 12:36

When you have a - inside the URL, the algorithm gets it duplicated and moves it to the beginning of the puny code.

For example:

áéíóú.com -> xn--1caqmy9a.com
á-é-í-ó-ú.com -> xn-------4na3c3a3cwd.com

I guess it has to do with the xn-- encoding restrictions.

This one should work for you:

(xn--)(--)*[a-z0-9]+.com.gr

The beginning of the code: (xn--) An even number (or 0) of --: (--)* The domain chars/numbers :([a-z0-9]+) The TLD of the domain : .com.gr

You can add http/https if you wish

Update:

After adding numbers to the URL I found that the regex needs a fix:

(xn--)(-[-0-9]{1})*[a-z0-9]+.com.gr

á-1é-2í-3ó-4ú.gr.com -> xn---1-2-3-4-7ya6f1b6dve.gr.com

Regex failing to match the punycode url

2 Answers2