1

I have one table with every Fortune 1000 company name:

FortuneList:

------------------------------------------------
|fid        |  coname                          |
------------------------------------------------
| 1         | 3m                               |
| 2         | Amazon                           |
| 3         | Bank of America                  |
| 999       | Xerox                            |
------------------------------------------------

I have a 2nd table with every user on my newsletter:
MyUsers:

------------------------------------------------
|uid    |  name      | companyname             |
------------------------------------------------
| 1350  | John Smith |  my own Co              |
| 2731  | Greg Jones |  Amazon.com, Inc        |
| 3899  | Mike Mars  |  Bank of America, Inc   |
| 6493  | Alex Smith |  Handyman America       |
------------------------------------------------

How do I pull out every one of my newsletter subscribers that works for a Fortune 1000 company? (By scanning my entire MyUsers table for every record that has any of the coname's from the FortuneList table)

I would want output to pull:

------------------------------------------------
|uid   |  name       | companyname             |
------------------------------------------------
| 2731  | Greg Jones |  Amazon.com, Inc        |
| 3899  | Mike Mars  |  Bank of America, Inc   |
------------------------------------------------

(See how it finds "Amazon" in the middle of "Amazon.com, Inc")

David B
  • 11
  • 1
  • This seems to work for a few search terms. But I have 999, so I need a dynamic query: SELECT m.* FROM `myusers` m WHERE MATCH (`companyname`) AGAINST ('Amazon' IN BOOLEAN MODE) I need to make the 'AGAINST' part include the full 1000 names from FortuneList – David B Feb 08 '14 at 22:53
  • Would it be okay for the solution to not use `MATCH...AGAINST`? Or does it *have* to use the FullText Index? – hichris123 Feb 08 '14 at 23:05
  • @hichris123 I will take any answer that helps me solve the problem. I only used MATCH...AGAINST because that's the closest I got with my web research. I was originally going to write a PHP script w/ a bunch of loops & stuff, but thought it would be easier if I can do it in mySQL. – David B Feb 10 '14 at 03:52
  • @ShWivel I'm using PHPMyAdmin, so Oracle won't help me – David B Feb 10 '14 at 03:53

3 Answers3

1

Try using this, which uses an INNER JOIN, the LIKE operator, and CONCAT:

SELECT *
FROM MyUsers
INNER JOIN FortuneList
    ON FortuneList.coname LIKE CONCAT('%', MyUsers.companyname, '%)

(This wouldn't use your Full Text index, I'm trying to figure out how you could use a MATCH...AGAINST in a JOIN.)

hichris123
  • 10,145
  • 15
  • 56
  • 70
  • I'll give this a try Monday. I hope it works. :) I do not have to have a MATCH...AGAINST. :) I don't even mind if it takes a really long time to process. This isn't a web script or anything. I'm just trying to analyze my database. Thanks! – David B Feb 10 '14 at 03:55
1

If you were doing this in Oracle, this would yield your desired result (with the example data):

with    fortunelist as(
        select 1 as fid, '3m' as coname from dual union all
        select 2, 'Amazon' from dual union all
        select 3, 'Bank of America' from dual union all
        select 999, 'Xerox' from dual
        )
        , myusers as(
        select 1350 as usrid, 'John Smith' as name, 'my own Co' as companyname from dual union all
        select 2731, 'Greg Jones', 'Amazon.com, Inc.' from dual union all
        select 3899, 'Mike Mars', 'Bank of America, Inc' from dual union all
        select 6493, 'Alex Smith', 'Handyman America' from dual 
        )
select  utl_match.jaro_winkler_similarity(myusers.companyname, fortunelist.coname) as sim
        , myusers.companyname
        , fortunelist.coname
from    fortunelist
        , myusers
where   utl_match.jaro_winkler_similarity(myusers.companyname, fortunelist.coname) >= 80

The reason being, the Jaro Winkler result for the 2 you're after are 87 and 95 (Amazon and BOA, respectively). You can bump the 80 in the query up or down to make the matching threshold higher or lower. The higher you go, the fewer matches you'll have, but the more likely they will be. The lower you go, the more matches you'll have, but you risk getting matches back that aren't really matches. For instance, "Handyman America" vs. "Bank of America" = 73/100. So if you lowered it to 70, you would get a false positive, using your example data. Jaro Winkler is generally meant for people's names, not company names, however because company names are typically also very short strings, it may still be useful for you.

I know you tagged this as MySQL and while this function does not exist, from what I've read people have already done the work creating a custom function for it: http://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/ http://dannykopping.com/blog/fuzzy-text-search-mysql-jaro-winkler

You could also try string replacements, ex. eliminating common reasons for a match not being found (such as there being an "Inc." on one table but not the other).

Edit 2/10/14:

You can do this in MySQL (via phpmyadmin) following these steps:

  1. Go into phpmyadmin then your database and paste the code from this URL link (below) into a SQL window and hit Go. This will create the custom function that you'll need to use in Step 2. I'm not going to paste the code for the function here because it's long, also it's not my work. It basically allows you to use the jaro winkler algorithm in MySQL, the same way you would with utl_match if you were using Oracle. http://androidaddicted.wordpress.com/2010/06/01/jaro-winkler-sql-code/

  2. After that function is created, run the following SQL:

-

select  jaro_winkler_similarity(myusers.companyname, fortunelist.coname) as similarity
        , myusers.uid
        , myusers.name
        , myusers.companyname as user_co
        , fortunelist.coname as matching_co
from    fortunelist
        , myusers
where   jaro_winkler_similarity(myusers.companyname, fortunelist.coname) >= 80

This should yield the exact result you're looking for, but like I said you'll want to play around with the 80 in that SQL and go up or down so that you have a good balance between avoiding false positives but also finding the matches that you want to find.

I don't have a MySQL database with which to test so if you run into an issue please let me know, but this should work.

Brian DeMilia
  • 13,103
  • 1
  • 23
  • 33
  • Thanks ShWiVel. I don't know Oracle & I'm not nearly that advanced. Sorry. – David B Feb 10 '14 at 03:56
  • Even though you're not using Oracle, someone has already created a custom function for MySQL (that you will be able to use with MySQL via phpmyadmin). I'll edit my post in a sec, but the first link I gave has the function that you would create, afterward you can use that function with syntax very similar to Oracle (and with the same functionality). – Brian DeMilia Feb 10 '14 at 22:38
  • On a side note, I don't know how large your tables are, but if that query takes a while to run, you can also join the first character of the company name between the 2 tables (and leave the query otherwise the same), that way you're filtering in on companies that start with the same letter right off the bat, and are only relying on that function to determine the similarity among however many rows between the 2 meet that condition for each company. – Brian DeMilia Feb 10 '14 at 23:14
0

Using LOCATE (no index thus):

select uid, name, companyname
from MyUsers JOIN FortuneList
WHERE LOCATE(coname, companyname) > 0
koriander
  • 3,110
  • 2
  • 15
  • 23