SQL query for entity names with small differences

Question

I have to query a very large .csv dataset from a large city that contains thousands of dirty data entries like the following example.

How would I make an SQL query that would catch all instances of records matching the following permutations of a company's name? Do I have to clean the .csv data somehow before exporting to SQL? Thanks in advance.

'UFS INDUSTRIES INC. DBA SALLY SHERMAN FOODS',
'UFS INDUSTRIES INC. DBA SALLY SHERMAN FOODS.',
'UFS INDUSTRIES INC., DBA SALLY SHERMAN FOODS',
'UFS INDUSTRIES INC., DBA, SALLY SHERMAN FOODS',
'UFS INDUSTRIES INCORPORATED',
'UFS INDUSTRIES INCORPORATED DBA SALLY SHERMAN FOOD',
'UFS INDUSTRIES INCORPORATED DBA SALLY SHERMAN FOODS',

SQL might not be the best place to do this analysis. Rather, maybe something like R or MATLAB would work better. You're going to need to define the scope of the problem, what constitutes the actual reference name, etc. — Tim Biegeleisen, Jul 24 '18 at 06:15
If you are looking at only the provided data you could use WHERE LEFT(the-string, 18) = 'UFS INDUSTRIES INC' — Sjoerd, Jul 24 '18 at 06:26
Do you know the company list in advance? If so, maybe select distinct using like expression? — Y.S, Jul 24 '18 at 06:26
You know the name of the companies and then: SELECT * FROM table WHERE ColumnName LIKE 'UFS%INDUSTRIES%INC%DBA%SALLY% SHERMAN%FOODS' — 2SRTVF, Jul 24 '18 at 06:34

score 0 · Accepted Answer · answered Jul 24 '18 at 07:14

0

I am adding this here as I can't put a comment because of reputation and I think you will find it important.

Cleaning Messy data in SQL

answered Jul 24 '18 at 07:14

Vishesh

30
5

SQL query for entity names with small differences

1 Answers1