0

the main question is a bit short so I'll collaborate. I'm building an app for twitter with which you can do the basic actions (get posts, do a post, reply etc.)

Now I figured it would be a good idea if I'd check the max 140 char limit in my app. So far so good, then someone asked if I could also do the url-shortener thing.

so at the moment I have a regex that picks op most (in fact too much) url's, takes the lenght of them and either adds or deduces the difference from the 140 max. It's still a but buggy but I can manage that.

Now my problem....

It seems twitter is quite picky in what they think is an url: I got the most basic ones (starting with http(s):// and such), but twitter also replaces some tld's very easily, (www.)google.com [whatever].net/.biz/.info are just a few of them) but not .nl .de .tk

Now I was wondering if perhaps someone has found out which ones they do and which ones they don't 'shorten'.

now because I'm pretty sure my regex isn't the best either I'll drop that here as well:

((http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:\/~\+#]*[\w\-\@?^=%&\/~\+#])?)|([\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:\/~\+#]*[\w\-\@?^=%&\/~\+#])?)
Sjaak van der Heide
  • 566
  • 2
  • 6
  • 21

3 Answers3

1

http://support.twitter.com/articles/78124-how-to-shorten-links-urls# indicates that all URLs posted to Twitter will be rewritten to be exactly 19 characters long.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I already found that page. At the moment it's 20 (21 for https). "Until recently, all HTTP-based t.co links on Twitter have been only 19 characters long. Part of the August 15th phasing is to increase t.co URL's default length to 20 characters. While we don't anticipate the length of t.co links to change frequently, we wanted to remind you to check the fields short_url_length and short_url_length_https from GET help/configuration daily, rather than relying on a hard-coded value." From the twitter dev page. – Sjaak van der Heide May 09 '12 at 08:37
0

I am using this: var url_expression = /[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?/gi; Nobody has complained :)

izeed
  • 1,731
  • 2
  • 16
  • 15
0

I figured it out, I found a pretty important line on the tld wikipage. It states that all country TLD's are two chars long. And also the other way around; all 2 char tld's are countries. With that in mind, I started testing a bunch of them with twitter and I'm pretty sure I now know what url's twitter shortens and which ones they don't.

  • All url's starting with http:// or https://
  • All url's like [something].[non country tld] # .com .biz .mobi etc. (Except .arpa & .aero)
  • All url's like [something].[something].[valid tld] # including countries

  • links like http://[user]:[pass]@[something].[tld] will NOT be shortened

Now to build a regex for it, i'll post it here as soon as I think I have it :D

this is what I got this far:

/(^(?:(?:ht|f)tp(?:s?)\:\/\/|~\/|\/)?(?:(?:[-\w]+\.)+(?:com|asia|cat|coop|edu|int|tel|pro|org|net|gov|mil|biz|info|mobi|name|jobs|museum|travel|([a-z]{2})))(?::[\d]{1,5})?(?:(?:(?:\/(?:[-\w~!$+|.,=\(\)]|%[a-f\d]{2})+)+|\/)+|\?|#)?(?:(?:\?(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)(?:&(?:[-\w~!$+|.,*:]|%[a-f\d{2}])+=?(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)*)*(?:#(?:[-\w~!$+|.,*:=]|%[a-f\d]{2})*)?)/gim;

one major flaw still in it, it also accepts [domain].[tld] which twitter doesn't.

I hope this will help someone in the future. I'm pretty sure there's not a whole lot easy-to-find info about this on the web (or at least I couldn't find it).

Sjaak van der Heide
  • 566
  • 2
  • 6
  • 21
  • What do you mean by "tld wikipage"? Have you tried IDNA domain names like http://en.wikipedia.org/wiki/.rf too? – tripleee May 09 '12 at 09:22
  • with tld wikipage I mean [link](http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains) and I just checked the link you provided, and yes Twitter also shortens that. – Sjaak van der Heide May 09 '12 at 11:44