4

I have an autohotkey script which looks up a word in a bilingual dictionary when I double click any word on a webpage. If I click on something like "l'homme" the l' is copied into the clipboard as well as the homme. I want the autohotkey script to strip out everything up to and including the apostrophe.

I can't get autohotkey to match the apostrophe. Below is a sample script which prints out the ascii values of the first four characters. If I double click "l'homme" on this page, it prints out: 108,8217,104,111. The second character is clearly not the ascii code for an apostrophe. I think it's most probably something to do with the HTML representation of an apostrophe, but I haven't been able to get to the bottom of it. I've tried using autohotkey's transform, HTML function without any luck.

I've tried both the Unicode and non-Unicode versions of autohotkey. I've saved the script in UTF-8.

#Persistent
return
OnClipboardChange:
;debugging info:
c1 := Asc(SubStr(clipboard,1,1))
c2 := Asc(SubStr(clipboard,2,1))
c3 := Asc(SubStr(clipboard,3,1))
c4 := Asc(SubStr(clipboard,4,1))
Msgbox 0,info, char1: %c1% `nchar2: %c2% `nchar3: %c3% `nchar4: %c4%

;the line below is what I want to use, but it doesn't find a match
 stripToApostrophe:= RegExReplace(clipboard,".*’")
keith.uk
  • 75
  • 4

1 Answers1

3

There is the standard quote ' and there is the "curling" quote .

Your regex might have to be

.*['’]

to cover both cases.

Maybe you'd like to make it non-greedy, too, if a word can have more than one apostrophe and you only want to remove the first:

.*?['’]

EDIT:

Interesting. I tried this:

w1 := "l’homme"
w2 := "l'homme"
c1 := Asc(SubStr(w1,2,1))
c2 := Asc(SubStr(w2,2,1))
v1 := RegExReplace(w1, ".*?['’]")
v2 := RegExReplace(w2, ".*?['’]")
MsgBox 0,info, %c1% - %c2% - %v1% - %v2%
return

And got back 146 - 39 - homme - homme. I'm editing from Notepad. Is it possible that our regex, while we think we're typing 8217, actually has 146 upon our pasting?

EDIT:

Apparently unicode support was added only for AutoHotkey_L. Using it, I believe the correct regex should be either

".*?[\x{0027}\x{0092}\x{2019}]"

or

".*?(" Chr(0x0027) "|" Chr(0x0092) "|" Chr(0x2019) ")"
Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • I've tried it with both types of quote, and neither worked. The curling quote in the script above is copied directly from the article in the link. I'm sure the answer lies in understanding why AHK prints 8217 as the ascii code for the second character of l'homme. The other characters are correct 108:l 104:h 111:o – keith.uk Sep 07 '12 at 14:34
  • Yes, I was planning to look into making it greedy,but first I want to get the basics working. I think the server is probably sending something beginning with as an html escape code. – keith.uk Sep 07 '12 at 14:45
  • Well, 8217 is the proper apostrophe, which I called the curling quote, ``’``. Maybe AHK is buggy regarding the character. Can you try escaping it? ``.*\’`` – Andrew Cheong Sep 07 '12 at 14:47
  • Ah, so clipboard might contain "’" you're saying? That's tricky. Your regex then may need to be ``.*?(?:'|’|’|’|'|’)`` to cover all the bases (hah). – Andrew Cheong Sep 07 '12 at 14:50
  • I've tried the version below, but still no good. stripToApostrophe:= RegExReplace(clipboard,".*?(?:'|’|’|’|'|’)") – keith.uk Sep 07 '12 at 14:58
  • Try two more things. First, ``.*?[^a-zA-Z0-9]``, as that should definitely work. Second, ``.*?('|’|\u2019)``. – Andrew Cheong Sep 07 '12 at 15:09
  • The first one is working--I'd wondered about trying that one myself. The second one isn't working - it seems to be returning an empty string. Thanks a lot for your help on this. I can go with the first solution if needs be. It would be interesting to understand why the basic version isn't working, but the main thing is that I now have a working script. – keith.uk Sep 07 '12 at 15:21
  • Regarding the notepad version in your edit, above, I don't know whether Notepad supports Unicode. If not, it might have silently converted it to a different character. – keith.uk Sep 07 '12 at 15:30
  • Ah, that would make sense, re: Notepad. I made one more edit that may be relevant, but glad the other solution works for now. Good luck. – Andrew Cheong Sep 07 '12 at 15:49
  • Thanks. The examples you gave with 0027, 0092, and 2091 are the best solution. The earlier examples, which used [A-Za-z], didn't exclude accented characters, such as éèê, etc. – keith.uk Sep 07 '12 at 17:36