0

I was having the url which on converting to punycode has suffix as xn---- which all the regex present in ruby libraries fails to match. Currently I am using validates_url_format_of ruby library. Example Url: "https://www.θεραπευτικη-κανναβη.com.gr" Punycode url: "https://www.xn----ylbbafnbqebomc7ba3bp1ds.com.gr"

So can you please suggest that is there any issue in the regex in the library or the issue lies in the conversion to punycode.

As per the punycode conversion rules the suffix always is xn--. So can anyone suggest what extra two -- means here

Sumit Sharma
  • 83
  • 2
  • 11

2 Answers2

0
"https://www.xn----ylbbafnbqebomc7ba3bp1ds.com.gr".match(/https?:\/\/w*\.xn----.*/)
=> #<MatchData "https://www.xn----ylbbafnbqebomc7ba3bp1ds.com.gr">

Note the url matcher is not perfect

dileep nandanam
  • 2,827
  • 18
  • 20
0

When you have a - inside the URL, the algorithm gets it duplicated and moves it to the beginning of the puny code.

For example:

áéíóú.com -> xn--1caqmy9a.com
á-é-í-ó-ú.com -> xn-------4na3c3a3cwd.com

I guess it has to do with the xn-- encoding restrictions.

This one should work for you:

(xn--)(--)*[a-z0-9]+.com.gr

The beginning of the code: (xn--) An even number (or 0) of --: (--)* The domain chars/numbers :([a-z0-9]+) The TLD of the domain : .com.gr

You can add http/https if you wish


Update:

After adding numbers to the URL I found that the regex needs a fix:

(xn--)(-[-0-9]{1})*[a-z0-9]+.com.gr

á-1é-2í-3ó-4ú.gr.com -> xn---1-2-3-4-7ya6f1b6dve.gr.com
Panciz
  • 2,183
  • 2
  • 30
  • 54