2

I want to create a function that will update a column of type varchar to a preferred string that is referenced in the column of another table to help me clean this column more iteratively.

CREATE TABLE big_table (
    mn_uid NUMERIC PRIMARY KEY,
    user_name VARCHAR
    );

INSERT INTO big_table VALUES
        (1, 'DAVE'),
        (2, 'Dave'),
        (3, 'david'),
        (4, 'Jak'),
        (5, 'jack'),
        (6, 'Jack'),
        (7, 'Grant'); 

CREATE TABLE nameKey_table (
    nk_uid NUMERIC PRIMARY KEY,
    correct VARCHAR,
    wrong VARCHAR
    );

INSERT INTO nameKey_table VALUES
        (1, 'David', 'Dave_DAVE_dave_DAVID_david'),
        (2, 'Jack', 'JACK_jack_Jak_jak');

I want to perform the following procedure:

UPDATE big_table
SET user_name = (SELECT correct
                 FROM nameKey_table 
                 WHERE wrong 
                 LIKE '%DAVE%')
WHERE user_name = 'DAVE';

but looped over each user_name in big_table so that I have a function that can do something like this:

UPDATE big_table SET user_name = corrected_name_fn();

Here is my attempt to do something like this but I can't seem to get it to work:

CREATE FUNCTION corrected_name_fn() RETURNS VARCHAR AS $$
DECLARE entry RECORD;
DECLARE correct_name VARCHAR;
BEGIN 
FOR entry IN SELECT DISTINCT user_name FROM big_table LOOP
     EXECUTE 'SELECT correct 
              FROM nameKey_table
              WHERE wrong 
              LIKE ''%$1%''' 
              INTO correct_name
              USING entry;
            RETURN correct_name;
            END LOOP;
END;    
$$ LANGUAGE plpgsql;

I want the final output in big_table to be:

| mn_uid |  user_name |
|   1    | 'David'    |
|   2    | 'David'    |
|   3    | 'David'    |
|   4    | 'Jack'     |
|   5    | 'Jack'     |
|   6    | 'Jack'     |
|   7    | 'Grant'    |

I realize rows 6 and 7 provide two unique cases that I want to build into the function with IF ELSE statements.

  1. If user_name is in nameKey_table.correct, go to next
  2. If user_name is not in nameKey_table.correct or does not match a string in nameKey_table.wrong, leave as is.

Thanks for any help on this!!

nmsindt
  • 21
  • 1
  • 2

2 Answers2

0

You don't need a function; you can just update one table from the contents of another table:


UPDATE big_table dst
SET user_name = src.correct
FROM nameKey_table src
WHERE src.wrong LIKE '%' || dst.user_name || '%'
AND dst.user_name <> src.correct -- avoid idempotent updates
        ;

And if you need performance, dont rely on the LIKE operator, it cannot use indexes for leading %. Instead, use a lookup-table with one entry per row:


CREATE TABLE bad_spell (
    correct VARCHAR,
    wrong VARCHAR PRIMARY KEY -- This will cause an unique index to be created.
    );

INSERT INTO bad_spell VALUES
        ('David', 'Dave')
        ,('David', 'DAVE')
        ,('David', 'dave')
        ,('David', 'DAVID')
        ,('David', 'david')
        ,('Jack', 'JACK')
        ,('Jack', 'jack')
        ,('Jack', 'Jak')
        ,('Jack', 'jak')
        ;
-- This indexes could be temporary
CREATE INDEX ON big_table(user_name);

-- EXPLAIN
UPDATE big_table dst
SET user_name = src.correct
FROM bad_spell src
WHERE dst.user_name = src.wrong
AND dst.user_name <> src.correct -- avoid idempotent updates
        ;

SELECT* FROM big_table
        ;
wildplasser
  • 43,142
  • 8
  • 66
  • 109
0

It sounds like you want a trigger on the table. Here is my suggestion:

CREATE OR REPLACE FUNCTION tf_fix_name() RETURNS TRIGGER AS
$$
DECLARE
    corrected_name TEXT;
BEGIN

    SELECT correct INTO corrected_name FROM nameKey_table WHERE expression ~* NEW.user_name;
    IF FOUND THEN
        NEW.user_name := corrected_name;
    END IF;

    RETURN NEW;
END;
$$
    LANGUAGE plpgsql;


CREATE TEMP TABLE big_table (
    mn_uid INT PRIMARY KEY,
    user_name TEXT NOT NULL
);

CREATE TRIGGER trigger_fix_name
    BEFORE INSERT
    ON big_table
    FOR EACH ROW
    EXECUTE PROCEDURE tf_fix_name();

CREATE TEMP TABLE nameKey_table (
    nk_uid INT PRIMARY KEY,
    correct TEXT NOT NULL,
    expression TEXT NOT NULL
    );

INSERT INTO nameKey_table VALUES
        (1, 'David', '(dave|david)'),
        (2, 'Jack', '(jack|jak)');

INSERT INTO big_table VALUES
        (1, 'DAVE'),
        (2, 'Dave'),
        (3, 'david'),
        (4, 'Jak'),
        (5, 'jack'),
        (6, 'Jack'),
        (7, 'Grant');

SELECT * FROM big_table;

+--------+-----------+
| mn_uid | user_name |
+--------+-----------+
|      1 | David     |
|      2 | David     |
|      3 | David     |
|      4 | Jack      |
|      5 | Jack      |
|      6 | Jack      |
|      7 | Grant     |
+--------+-----------+
(7 rows)

Note: I think you can do what you want a lot easier with a case insensitive regular expression. And I also changed your primary keys to INTs. Not sure why they are numerics, but it doesn't really change the solutions. My solution was developed and tested on PostgreSQL 9.6.

David S
  • 12,967
  • 12
  • 55
  • 93