9

I'm trying to locate some bad data that has been inserted into a table. Likely by someone doing a copy/paste from Word then inserting into the database.

I have seen the similar questions like Query for finding rows with special characters

but this doesn't quite work for what I'm needing. Essentially I want to only return back a data set not including any standard characters and catch things such as an endash (just one example).

I have tried using something like this

SELECT * FROM mytable WHERE email LIKE '%[^0-9a-zA-Z \-@\.]%'

but it returns back every single single record.


EDIT

In case it is of benefit for anyone else that comes along later. Ultimately the issue I was having was due to the placement of the hyphen (-) character as was also noted by sgmoore below. I moved this to the beginning of my range just following the not (^) character.

Also, based on the info provided by gbn that LIKE is not actually using regexes I revisited the Microsoft documentation here SQL Server LIKE Statement. I was using the backslash character unnecessarily as an escape character due to my assumption it was the same as a regex. These were unnecessary, and apparently escape characters are only needed with wildcard characters. The doc I linked also mentions using an ESCAPE clause following the LIKE range to specify what character is to be used as an escape character e.g. WHERE percent_complete LIKE '%50!%' ESCAPE '!' would match a string that actually ends in 50% (50%, 150%).

Here is what I ended up using to screen my email data for bad characters; for me it works, but it may not be complete for all cases.

SELECT * FROM mytable WHERE email LIKE '%[^-0-9a-zA-Z_@.]%'

also if it is helpful, I needed to do something similar on a couple of other generic text fields; this far from comprehensive, but it narrowed my result set down to just a handful of records that I was then able to visually determine what I was looking for.

SELECT * from mytable WHERE text_field LIKE '%[^-0-9a-zA-Z @.''?:/,+&();_]%'

Community
  • 1
  • 1
bigtunacan
  • 4,873
  • 8
  • 40
  • 73
  • 2
    +1 seems ok to me this question - not sure why it is marked -1 – whytheq Apr 03 '13 at 16:45
  • @whytheq, Thanks I'm not quite sure why I'm getting so many down votes on this question. Apparently none of the down voters cares to comment as to why either :( – bigtunacan Apr 03 '13 at 16:57

3 Answers3

7

Try

SELECT * FROM mytable WHERE email LIKE '%[^0-9a-zA-Z @\.\-]%'

It would look like the position of the - sign on your version is causing problems.

sgmoore
  • 15,694
  • 5
  • 43
  • 67
  • Just saw your post. Indeed the positioning of the hyphen was an issue. I moved it to the front of my range. I also removed the backslashes as they were apparently unnecessary as well. – bigtunacan Apr 03 '13 at 17:58
2

Use double negatives

... WHERE email NOT LIKE '%[^0-9a-zA-Z ,-@\.]%'

Sample data would be useful too

gbn
  • 422,506
  • 82
  • 585
  • 676
  • That is not the issue the ^ character at the beginning of my regex match set means NOT contained in the following. When I run for example ... WHERE name LIKE '%[^0-9a-zA-Z ]%' I get an empty result set as I would expect. Here I am also trying to screen out hypens, periods, and the @ character, so something in my syntax is not quite right. – bigtunacan Apr 03 '13 at 16:00
  • @bigtunacan: did you try it? Also, LIKE is not regex. You only need to escape opening brackets http://stackoverflow.com/questions/3661125/sql-server-like-containing-bracket-characters/3661145#3661145 – gbn Apr 03 '13 at 16:08
  • Sorry, you are correct that it is not regex, but acts similarly to a limited subset of a regex range. The ^ character at the beginning of the range indicates match only that which is NOT contained in the range as seen here in the official Microsoft docs http://msdn.microsoft.com/en-us/library/ms179859.aspx – bigtunacan Apr 03 '13 at 16:14
  • @bigtunacan: did you try it? – gbn Apr 03 '13 at 16:28
  • Sorry; yes I tried dropping out the ^ and using NOT, the results were the same. I noticed on the Microsoft documentation an example where they are using the hyphen in the range and it is at the beginning. So I tried changing my range from '%[^0-9a-zA-Z -@.]%' to '%[^-0-9a-zA-Z @.]%'. This gave me the results I was expecting. Maybe this is documented somewhere that the hyphen must be at the beginning; if so I missed, but anyway it works. – bigtunacan Apr 03 '13 at 16:51
  • Also, thanks for pointing out that it not actually a regex; I was originally also trying to escape characters using \ like I would with a regex such as the hyphen and period, but these are apparently not special characters in the SQL range. – bigtunacan Apr 03 '13 at 16:53
1

Presumably, every email has a @ character as well as .. You might try:

SELECT * FROM mytable WHERE email LIKE '%[^0-9a-zA-Z ,\]%'

If your original list is what you really want, then you need to escape -:

SELECT * FROM mytable WHERE email LIKE '%[^0-9a-zA-Z ,\-@\.]%'
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Sorry; forgot to escape the hyphen when typing it in the original question. Even with the hyphen character escaped it is still returning all results. I have also dropped the comma as was unnecessary for the email field. I've updated my question above; issue is still the same. – bigtunacan Apr 03 '13 at 16:06