0

I need to check some html files and extract the urls that are not referred to 2 websites

after many tests I got this

/(http|https)?:?(\/\/)\w*\.*\-*[^(mysite.com)]\w*\.?\S*/igm

that works not bad.. but not perfectly:

for example, as can see HERE on regexr.com it matches

// End

but not

www.demo.com

while should be the countrary, but adding a ? after (\/\/) it becomes an unusful "catch all"

and if url has a " at beginning and at the end, and this clearly happens frequently does not grab starting " (correctly) but grab ending one (wrong)

finally it should not match also theothermysite.net but do well understood how to handle OR with Negation :-(

can help please?

Joe

Joe
  • 1,033
  • 1
  • 16
  • 39

1 Answers1

1

Like this?

/((http|https):(\/\/)|www\.)\w*\.*\-*[^(mysite.com)(theothermysite.net)]\w*\.?[^\s\t\r\n\"]*/igm

I just added a "or www", replaced \S with its components plus \" and added another atomic group to the negation like you already did with mysite.com

Fabian N.
  • 3,807
  • 2
  • 23
  • 46