4

I need to make a URL pattern which could work with this URL:

mysite.com/blog/12/بلاگ-مثال

It contains utf-8 characters so I tried using \X:

re_path(r'^blog/?P<blog_id>[\d+]+/(?P<slug>[\X.*]+)/$', views.single_blog, name='single_blog')

But it didn't work. I don't know why. Maybe just because I'm not good in regex. So I tried a different pattern using just .* to accept anything:

re_path(r'^blog/?P<blog_id>[\d+]+/(?P<slug>[.*]+)/$', views.single_blog, name='single_blog')

But this also doesn't work and I get:

The current path, blog/12/بلاگ-مثال, didn't match any of these.

So as I mentioned I'm not good in regex, what's the right way to fix this?

Is it the right time to say now I have two problems or regex is the only way?

meshy
  • 8,470
  • 9
  • 51
  • 73
Ghasem
  • 14,455
  • 21
  • 138
  • 171
  • 1
    It didn't work because `\X` is not supported by Python `re` and `[.*]+` matches 1+ dots or asterisks, but not any chars. I guess you need `r'^blog/(?P\d+)/(?P[^/]+)/$'`. If the `/` is optional, add `?` after it, `r'^blog/(?P\d+)/(?P[^/]+)/?$'`. – Wiktor Stribiżew Jun 29 '18 at 20:34

2 Answers2

2

Your approach to match something did not work since \X is not supported by Python re and [.*]+ matches 1+ dots or asterisks, but not any chars (because you put .* into [...] character class where they denote literal symbols, not special chars).

Besides, [\d+]+ is also a character class matching any digit or +, 1 or more times, so there is also a problem.

You may use a [^/] negated character class to match any char but /:

r'^blog/(?P<blog_id>\d+)/(?P<slug>[^/]+)/?$'

Details

  • ^ - start of input
  • blog/ - a literal substrig
  • (?P<blog_id>\d+) - Group "blog_id": 1+ digits
  • / - a /
  • (?P<slug>[^/]+) - Group "slug": 1+ chars other than /
  • /? - an optional /
  • $ - end of string.

Here is a regex demo (note highlighting characters from the Arabic script is not working there.)

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 2
    FWIW when I put it into Google translate it detects Persian. I believe there's a difference between Persian and Arabic, but not sure how pertinent it is for this solution. They both use the same alphabet https://ru.lingualinx.com/blog/farsi-vs-arabic-a-look-at-how-the-two-languages-match-up/ –  Jun 30 '18 at 11:08
  • @YvetteColomb Yes, they have common letters but doesn't share same alphabet. Besides, we can't conclude the language from a set of letters. – revo Jun 30 '18 at 11:28
  • 1
    @ revo ah, sorry if I've offended you, I know little about the languages. Hm Im glad you edited, I was thinking of doing the same. –  Jun 30 '18 at 11:29
  • @YvetteColomb I wouldn't up-vote your comment if I felt offended. – revo Jun 30 '18 at 11:31
  • I expect the answerer to rollback this edit but I hope it doesn't happen. @YvetteColomb – revo Jun 30 '18 at 11:34
  • 1
    @YvetteColomb I haven't tried the answer yet. But the language is indeed Persian. Sure they share same alphabets except `گ پ ژ چ` which is used in my sample. So its definitely Persian. Just wanted to clear things up between you two before I test it. *giggles* – Ghasem Jun 30 '18 at 16:11
  • 2
    @AlexJolig Regex is language-unaware. It can only match *characters* that various languages may share. If my name is written as `Стрибижев`, I do not say it is composed of Russian chars, I'd say *Cyrillic*, because these are characters from the Cyrillic script. Here, the characters you have in the post are from the Arabic script, hence, I used "Arabic". – Wiktor Stribiżew Jun 30 '18 at 17:22
2

Is it the right time to say now I have two problems ...

In fact, you have chosen the right job for this task.

The other answer seems valid but can't tolerate to have the word Persian in it. I'm posting this answer to throw some points of why your own regex doesn't work as expected.

  1. ?P<blog_id>[\d+]+

Probably you meant a named group here, the same as the one you used later in regex. You missed opening and closing parentheses: (?P<blog_id>[\d+]+). Also [\d+] means a character class consisted of digits and +. You need to remove +: (?P<blog_id>[0-9]+)

  1. (?P<slug>[\X.*]+)

Construction is fine as it should be but character class is not. \X doesn't have a special meaning in a character class, let alone Python that doesn't support it by its re module even. .* is no exception. In a character class almost all special tokens are treated literally.

So [\X.*] matches a X or a . or an asterisk *. You need to change it to something more general like [^/]+ which means match up to the first slash (= match anything except forward slash).

revo
  • 47,783
  • 14
  • 74
  • 117