3

My project is importing a sizable collection +500K rows of data from flat Excel files, which are manually created by a team of people. Now the problem is that it all needs to be normalized, for client searching. For example, the company field will have multiple company spellings and include branches, such as "IBM" and then "IBM Inc." and "IBM Japan" etc. Additionally, I have product names that alphanumeric, such as "A46-Rhizonme Pentahol", which SOUNDEX alone cannot handle.

I can solve the issue in the long term by having all the data input be through a web form, with an AJAX auto-suggest. Until then however, I still need to deal with the massive collection of existing data. This brings me to what I believe is a good process, based on what I've read here:

http://msdn.microsoft.com/en-us/magazine/cc163731.aspx

Steps to create a custom Fuzzy Logic Lookup, and Fuzzy Logic Grouping

  1. List item
  2. tokenize strings into keywords
  3. calculate keyword TF-IDF (total frequency - inverse document frequecy)
  4. calculate levenshtein distance between keywords
  5. calculate Soundex on available alpha strings
  6. determine context of keywords
  7. place keywords, based on context, into separate DB tables, such as "Companies", "Products", "Ingredients"

I've been Googling, searching StackOverflow, reading over MySQL.com discussions, etc. about this issue, to attempt to find a prebuilt solution. Any ideas?

Apollo Clark
  • 806
  • 9
  • 16

2 Answers2

4

So, I gave up and just made a string tokenizing function for mysql. Here's the code:

CREATE DEFINER = `root`@`localhost` FUNCTION `NewProc`(in_string VARCHAR(255), delims VARCHAR(255), str_replace VARCHAR(255))
 RETURNS varchar(255)
    DETERMINISTIC
BEGIN
    DECLARE str_len, delim_len, a, b, is_delim INT;
    DECLARE z, y VARBINARY(1);
    DECLARE str_out VARBINARY(256);
    SET str_len = CHAR_LENGTH(in_string), delim_len = CHAR_LENGTH(delims),a = 1, b = 1, is_delim = 0, str_out = '';

    -- get each CHARACTER
    WHILE a <= str_len DO
        SET z = SUBSTRING(in_string, a, 1);
        -- loop through the deliminators
        WHILE b <= delim_len AND is_delim < 1 DO
            SET y = SUBSTRING(delims, b, 1);
            -- search for each deliminator
            IF z = y THEN
                SET is_delim = 1;
            END IF;
            SET b = b + 1;
        END WHILE;

        IF is_delim = 1 THEN
            SET str_out = CONCAT(str_out, str_replace);
        ELSE
            SET str_out = CONCAT(str_out, z);
        END IF;

        SET b = 0;
        SET is_delim = 0;
        SET a = a + 1;
    END WHILE;
    RETURN str_out;
END;

It's called like this:

strtok("this.is.my.input.string",".,:;"," | ")

and will return

"this | is | my | input | string"

I hope someone else finds this useful. Cheers!

Apollo Clark
  • 806
  • 9
  • 16
  • I like this. The only way to tokenize strings in MySQL is with a stored function like yours. – Ben English Dec 16 '11 at 20:25
  • 1
    Yeah, it's a bit annoying, though MySQL amusingly has REGEX support. I'm a PHP developer, but I've been getting more into MySQL and trying to move more of the logic into the DB for optimization. – Apollo Clark Dec 20 '11 at 13:58
  • I think in case of doing string manipulation you're better off doing that in your application code. Of course there are some situations where performance-wise you'd be forced to use MySQL, but it's a task the RDBMS isn't particularly well suited for. I'm sure you saw REGEXP on MySQL and started salivating, only to realize it wasn't the swiss army knife it is in so many other languages. – Ben English Dec 20 '11 at 20:13
  • 1
    Interesting... may I ask whether you have done any timing comparisons between tokenization using PHP and using your MySQL function? – mike rodent Jun 15 '13 at 08:19
  • mike - No I haven't, since performance will depend on the use cases. If you only want to tokenize a couple of rows, then do it in PH by querying for the rows, tokenize in PHP, and either output to HTML / INSERT / UPDATE, etc. If however you are doing a moderately complex search or filtering in MySQL, using this tokenization function is the best way, instead of pulling out every row to process it in PHP. http://www.onextrapixel.com/2010/06/23/mysql-has-functions-part-5-php-vs-mysql-performance/ – Apollo Clark Jun 17 '13 at 13:41
2

You should check out Google Refine.

Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

Chewie
  • 7,095
  • 5
  • 29
  • 36
  • Great resource, thanks! I'll have to see how well it integrates, since this is a highly manual solution. I'm curious to see if the process can be automated. – Apollo Clark Dec 16 '11 at 16:16