notepad++ multipart regex replace on over 2000 files - confirmation request

Question

I need to apply a Notepad++ "Find in File" & Replace operation against over 2000 files in 15 directories. I am new with this so hoping someone with experience can help me assure I'm not making a mistake or missing something small. Of course we have a backup of the files, but I don't want to create small issues that might creep up down the road due to forming this incorrectly or not accounting for something:

The following will be run (one folder at a time) on *.htm / *.html files in 2nd level folders that currently reference same-level files/folders relatively (i.e. "filename.htm") and we need these references to be with a literal path from root (i.e. "/folder-name/filename.htm"). But, of course we don't want to impact any other type of url reference.

So, the regex/replace should find all href="whatever" and src="whatever" statements, EXCLUDING those:

- starting with "http:
- starting with "/
- starting with "..
- including www. 
- starting with "#

...and prepend the url with /folder-name/

Filter: *.htm;*.html
Find: (href|src)="((?!http:|\/|\.\.|.*www\.|#)[^"]+)"
Replace: \1="/folder-name/\2"

This seems to test out ok - but does this appear to be a proper regex for this operation (w/notepad++)? Is there anything this statement isn't taking into account? is there a better/safer way to do this? Thanks for any feedback

For the reason seen below... right now I know just enough to be dangerous - hopefully that will change soon... ;-) — soyo, Sep 28 '14 at 08:30

score 3 · Accepted Answer · answered Sep 28 '14 at 07:34

3

Nope, there is a minor problem here:

Find: (href|src)="((?!http:|\/|\.\.|.*www\.|#)[^"]+)"
                                    ^^

.* is a construct that matches everything (except newlines), the pointer will be moved to the end and backtrack to match. This makes the match inefficient. You can change it to [^"]*?www\..
You would also want to combine \/|# into a character class as regex-alternation operator | backtracks. [\/#]

From there this replacement will work:

Find: (href|src)="(?!http:|[\/#]|\.\.|[^"]*?www\.)([^"]+)"
Replace: \1="/folder-name/\2"

answered Sep 28 '14 at 07:34

Unihedron

10,902
13
62
72

It's better to use `$1` in the replacement part instead of `\1` that is reserved for regex part, ans there're no needs to escape the slashe – Toto Sep 28 '14 at 08:11
@M42 "It's better to use `$1` in the replacement part" Is this so? Does this give an improvement over readability or if anything else? I'm curious to hear if this is true, as most regex implementations (Ruby, PHP) deals with both `$0` and `\0`. – Unihedron Sep 28 '14 at 08:13
++ Exactly the kind of feedback I was looking for, thanks much! I will look closer at how that [^"]*?www combo works. Forced to learn this stuff on the fly so the help here is INVALUABLE... – soyo Sep 28 '14 at 08:13
@soyo Try out a regex demo here http://regex101.com/r/qD0gB4/1. This uses PCRE so the outcome will be slightly different than from Notepad++ engine. – Unihedron Sep 28 '14 at 08:15
You may find informations here: http://perldoc.perl.org/perlretut.html#Backreferences and http://stackoverflow.com/a/7559780 – Toto Sep 28 '14 at 08:31
@M42 Thanks for the insighted references to relevant threads, but they are to Perl regex engines. Notepad++ uses [PCRE](http://pcre.org) since version 6. – Unihedron Sep 28 '14 at 08:33
PCRE comes from perl regex (Perl Compatible Regular Expression). – Toto Sep 28 '14 at 08:36
1

As it is said in http://pcre.org/pcre.txt, the notation `\nnn` is a bacreference, it is not to be used in the replacement part. – Toto Sep 28 '14 at 08:45

notepad++ multipart regex replace on over 2000 files - confirmation request

1 Answers1