0

(I'm using SQLite3)

My table has fields like

Name
Zipcode
State
City
State
StreetAddress

The issue is that in the name field, people entered data that is inconsistently formatted

For example,

Name (single column)
====================
John Smith MD
Jason Young M.D
Charlie Horse MD.
Kim Powell PH.d
Sandy Fox Phd M.D.

I want to clean up the MD to be saved as M.D. and Phd/PH.d to be PhD (please remember that the name is 1 field).

The street address also has issues like multiple spaces between words instead of 1.

Can someone please explain what I can do to format the data? I could do it manually but it would take too long!

RichardTheKiwi
  • 105,798
  • 26
  • 196
  • 262
Cocoa Dev
  • 9,361
  • 31
  • 109
  • 177

6 Answers6

2

You shouldn't simply accept data from a user interface and immediately persist it in a database without validating and scrubbing. You're safer from SQL injection that way, too.

I don't know what object(s) you have between your view and persistence tiers, but I'd suggesting adding a validation and binding layer in your controller or service layer, preferably the controller.

duffymo
  • 305,152
  • 44
  • 369
  • 561
  • 1
    +1. I would also add that if you tend to get bad quality input, the UI can often be adjusted to make it easier for users to do the right thing. – Mark Thomas Jan 23 '11 at 19:17
  • Exactly. Where else will you route errors to users to tell them what to do next? I missed spelling out that detail, but you caught it. – duffymo Jan 23 '11 at 19:18
  • The users enter data in a webpage. The data is stored on a 3rd party site. (which requires a username/password to access). I'm given the task of making a Desktop application which will contain all the data from the site. Someone will copy/paste data from the 3rd party site into the application. So I want to run some SQL query (if possible) to clean it up – Cocoa Dev Jan 23 '11 at 19:44
2

You can try using Google Refine (a free downloadable program) http://code.google.com/p/google-refine/ Great for all sorts of data cleaning...

dartdog
  • 10,432
  • 21
  • 72
  • 121
1

This is a series of string translations. Most of them are easy, and regex's may not be required for many of them.

Simple algorithm:

1) Split the name on spaces

2) discard empty values, these are the results of two or more spaces

3) Take each value, convert to upper case, remove periods, dashes or anything else and look for it in a table of known suffixes, like "MD", "PHD", and replace it with the desired value.

4) From step 3, anything that does not match, leave as-is

Obviously this requires you to build a list of known suffixes like MD, etc. This is not that bad. Large professional packages that do this in Direct Marketing handle it that way.

5) Recombine the resulting values

There is no definition of correct for formatting names, it is a question of how far you want to go for how much uniformity you require.

Adding to that, Duffy's advice is sound.

Ken Downs
  • 4,707
  • 1
  • 22
  • 20
1

For the address, you may want to validate it against your region/country specific address database.

I want to clean up the MD to be saved as M.D. and Phd/PH.d to be PHd (please remember that the name is 1 field).

Find all the variations of MD and PHD and run a series of updates (replace function):

update tbl set name = replace(name, 'M.D.', 'MD')
update tbl set name = replace(name, 'MD.', 'MD')
update tbl set name = replace(name, 'M.D', 'MD')
update tbl set name = replace(name, 'M.D', 'MD')
update tbl set name = replace(name, ' MD', ' MD')  -- fix case, hope no name starts with "MD"
update tbl set name = replace(name, 'PHD', 'PhD')  -- fix case
update tbl set name = replace(name, 'PH.D', 'PhD')  -- fix case
update tbl set name = replace(name, 'PH-D', 'PhD')  -- fix case
etc for any other variants you can think of or encounter

The street address also has issues like multiple spaces between words instead of 1

Replace two spaces with one, multiple times.. after removing linebreaks. To more quickly collapse series of spaces, we use 8->1 twice, then 4->1, then 2->1

update tbl set address = replace(address, '\r', ' ')
update tbl set address = replace(address, '\n', ' ')
update tbl set address = replace(address, '        ', ' ')
update tbl set address = replace(address, '        ', ' ')
update tbl set address = replace(address, '    ', ' ')
update tbl set address = replace(address, '    ', ' ')
update tbl set address = replace(address, '  ', ' ')
update tbl set address = replace(address, '  ', ' ')
Community
  • 1
  • 1
RichardTheKiwi
  • 105,798
  • 26
  • 196
  • 262
0

All of the above are sound and probably necessary. I'd like to add that it is also preferable to limit the input on fields as much as possible. For example the suffixes field should be limited to pre-validated values by using a drop-down list. This simplifies the data entry and data verification.

Dave
  • 1,234
  • 13
  • 24
-1

Regular expressions and string operations will may work for finite subsets of address data but there are lots of exception cases in the address industry. The USPS is the authority here (at least for US addresses), so why not utilize a service that is certified with the USPS?

For the addresses, try out this batch processing service:

http://www.smartystreets.com/products/cass-certified-scrubbing

SmartyStreets also offers an API with a free subscription for low-usage users. I'm a software developer at SmartyStreets and have helped with both of these services.

Michael Whatcott
  • 5,603
  • 6
  • 36
  • 50
  • Thanks for the suggestion but I don't think StackOverflow is meant to promote services for money. – Cocoa Dev Jan 27 '12 at 16:46
  • I certainly didn't answer this question because I was trying to land a sale. I'm not a salesman. The purpose of SO is to find solutions to problems. It just so happens that some services which are targeted to solving complex, industry-specific problems also cost money. Most people with address data don't have any idea how inconsistent their data is and it's painful to see people trying to solve the problem the wrong way. It would be irresponsible of me to just ignore the fact that this service does cost money or to not respond at all. – Michael Whatcott Jan 27 '12 at 19:46
  • Hm, @CocoaDev, according to the StackOverflow FAQ, http://stackoverflow.com/faq, mdwhatcott's answer is totally valid and useful: "Post good, relevant answers, and if some (but not all) happen to be about your product or website, so be it." He has also disclosed his affiliation. I see nothing wrong with his answer. – Matt Feb 07 '12 at 19:49