3

I'm working on a script to anonymize a table with patient data. I generated a table containing 50,000 rows of anonymous data.

What I need is a number of columns in the patient table to be updated with data from the generated table.

Ofcourse I already read about updating tabels and how to select random row from another table. What I can't figure out is how to combine this in one query.

I've seen CTE as a possible solution, but I don't understand how it works. One of the main issues I have is that the generated table doesn't have any key in it and even if it did it shouldn't be relevant since I just want to iterate ALL rows of the patient table updating its values with a random row from the generated table.

I have the following:

Update Patients
Set Patients.pat_FirstName = fn.GivenName,
     pat_LastName = fn.SurName, 
     pat_StreetName = fn.StreetAddress,
     pat_PostalCode = fn.ZipCode,
     pat_City = fn.City,
     pat_DateOfBirth = fn.BirthDay,
from
     ( Select Top 1,
     GivenName, 
     SurName,
     StreetAddress,
     ZipCode,
     City,
     Birthday
from FakeNameGenerator tablesample(1000 rows)) as fn


but that executes the 'random' only once fill every row in the patient table with the same values. Like I said before, it can be (should be??) solved with CTE (tally?) tables, but how?

I'm close to grabbing C# and just code the darn thing...

Frank Kaaijk
  • 293
  • 1
  • 3
  • 12
  • 1
    @MahmoudGamal - There isn't one. It is supposed to be random. @ Frank you need to make sure that the sub query is correlated example here http://stackoverflow.com/a/12922951/73226 though that only updates one column. If you had supplied `CREATE TABLE` for both tables involved I might have looked at altering it. – Martin Smith Feb 06 '13 at 09:34
  • 1
    @MartinSmith - Very cool. It indeed only updates but one column. My interest has been sparked and although I'm quite happy calling this SQL for each column (it random data after all) I'm curious as to how to handle it in one row per query. Why do you need my create statements? I don't need to change the database. Do I? – Frank Kaaijk Feb 06 '13 at 10:17
  • 1
    What DBMS are you using? If you can use `CROSS APPLY` you can apply the same principle of @MartinSmith's answer in the link using something like `UPDATE Patients SET Patients.pat_FirstName = fn.GivenName,.... FROM Patients CROSS APPLY (SELECT TOP 1 .... ORDER BY NEWID(), Patients.PatientID) fn` – GarethD Feb 06 '13 at 10:40
  • @GarethD - From `tablesample` and `TOP` very likely SQL Server so your comment should probably be an answer. – Martin Smith Feb 06 '13 at 10:48
  • @GarethD - DBMS is MS SQL Server 2012. `Cross apply` almost does the trick, but for unknown reason (to me). It's reusing the 'random' query result for 19 rows and only then select a new 'random' row. Again for 19 rows. What does cross apply do? – Frank Kaaijk Feb 06 '13 at 10:54
  • Are you making sure that you are correlating on a unique column from `Patients`? It might be adding a spool and replaying the last result if the column is not unique and has the same value. – Martin Smith Feb 06 '13 at 11:01
  • @MartinSmith - Yes, I'm using the patientID which is the primary key. That should garantuee something. As matter of fact when the query is done `(445186 row(s) affected)` which is the number of records in patient. – Frank Kaaijk Feb 06 '13 at 11:06
  • 1
    Well there's nothing special about 19 rows. Is that behaviour consistently reproducible and occurs throughout all the table? If you post the XML for the actual execution plan I could take a look at it. – Martin Smith Feb 06 '13 at 11:09

3 Answers3

3

Another way of doing it would be to add a contiguous numeric column to the FakeNameGenerator table

ALTER TABLE FakeNameGenerator ADD ID INT NOT NULL IDENTITY(1,1)

CREATE UNIQUE NONCLUSTERED INDEX ix ON FakeNameGenerator(ID)

Then it just becomes a problem of generating a random number between 1 and 50,000

UPDATE P
SET P.pat_FirstName = F.GivenName /*...*/
FROM Patients P
INNER LOOP JOIN FakeNameGenerator F ON F.ID = (1 +  ABS(CRYPT_GEN_RANDOM(8)%50000))

The INNER LOOP JOIN hint enforces a nested loops join with Patients as the driving table. It seeks into FakeNameGenerator for each row re-evaluating the ID to seek on.

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • 2
    I honestly think I learn at least one new thing every time I read one of your answers! – GarethD Feb 06 '13 at 12:34
  • I've adapted it to my tables like so: `UPDATE Patients SET pat_FirstName = fn.GivenName, pat_LastName = fn.SurName, pat_StreetName = fn.StreetAddress, pat_PostalCode = fn.ZipCode, pat_City = fn.City, pat_DateOfBirth = fn.BirthDay FROM Patients INNER LOOP JOIN FakeNameGenerator fn ON fn.ID = (1 + abs(CRYPT_GEN_RANDOM(8)%50000))`, but the result is the same a from where I started (the same value in all the rows). If you want to can I send the `CREATE SCRIPT` for Patient and FakeNameGenerator? I assume you don't need anything else. – Frank Kaaijk Feb 06 '13 at 13:09
  • @FrankKaaijk - The XML for the actual execution plan is the only thing I would be interested in seeing. – Martin Smith Feb 06 '13 at 13:33
  • @MartinSmith I tried this method to update the first/last names in a 500.000 rows table using a 4000 rows random name table. What happens is that about 7-8 names end up getting used 20.000 times, while the rest of the names get used 70-80 times. I re-run it for one of those names repeated 20.000 times with a `where P.FirstName = 'John'` and it selects another random name 20.000 times. All those John's become Mike. Only after repeating this a few times the query replaces Mike with random names. Then I have to repeat for the rest of names that were selected 20.000 each. Any idea what's going on? – artm Apr 21 '15 at 03:14
  • @artm not without seeing the exact queries and plans. Does the plan have a spool? – Martin Smith Apr 21 '15 at 08:23
  • @MartinSmith This is the query: `update P set FirstName = R2.FirstName, LastName = R.LastName from Person P inner loop join RandomMaleNames R on R.RandomMaleNamesID = (1 + ABS(CRYPT_GEN_RANDOM(8) % 4000)) inner loop join RandomMaleNames R2 on R2.RandomMaleNamesID = (1 + ABS(CRYPT_GEN_RANDOM(8) % 4000)) where P.PersonID not in (select PersonID from Users u where u.Internal = 1) AND P.Gender = 'MALE'` And yeah there're a few lazy spools in the plan. – artm Apr 24 '15 at 00:25
1

There is a much more detailed Q&A on using random sorting within subqueries here. But in short, I could not find a way to get this working by selecting a top 1 from random data. I had give both the source and the sample data a random id (using ROW_NUMBER) then join the two to get a random update:

WITH PatientCTE AS
(   SELECT  PatientID,
            pat_FirstName,
            pat_LastName,
            pat_StreetName,
            pat_PostalCode,
            pat_City,
            pat_DateOfBirth,
            rn = ROW_NUMBER() OVER(ORDER BY NEWID())
    FROM    Patients
), SampleData AS
(   SELECT  GivenName, 
            SurName, 
            StreetAddress, 
            ZipCode, 
            City, 
            Birthday,
            rn = ROW_NUMBER() OVER(ORDER BY NEWID())
    FROM    FakeNameGenerator
)
UPDATE  PatientCTE
SET     Patients.pat_FirstName = fn.GivenName,
        pat_LastName = fn.SurName, 
        pat_StreetName = fn.StreetAddress,
        pat_PostalCode = fn.ZipCode,
        pat_City = fn.City,
        pat_DateOfBirth = fn.BirthDay,
FROM    PatientCTE p
        INNER JOIN SampleData fn
            ON fn.rn = p.rn

EDIT

OK, so it appears my testing was not comparable and my original thought would work. With no DDL and data to test with I can't be sure, but this should work:

UPDATE  Patients
SET     Patients.pat_FirstName = fn.GivenName,
        pat_LastName = fn.SurName, 
        pat_StreetName = fn.StreetAddress,
        pat_PostalCode = fn.ZipCode,
        pat_City = fn.City,
        pat_DateOfBirth = fn.BirthDay,
FROM    Patients
        CROSS APPLY
        (   SELECT  TOP 1 
                    GivenName, 
                    SurName, 
                    StreetAddress, 
                    ZipCode, 
                    City, 
                    Birthday
            FROM    FakeNameGenerator TABLESAMPLE(1000 ROWS)
            ORDER BY NEWID(), Patients.Patient_ID
        ) fn
Community
  • 1
  • 1
GarethD
  • 68,045
  • 10
  • 83
  • 123
  • 1
    This assumes that `FakeNameGenerator` has `>=` the number of rows of `Patients` and is random in a different way (like shuffling cards rather than rolling a die) in that once picked a row from `FakeNameGenerator` cannot be picked again. – Martin Smith Feb 06 '13 at 11:24
  • @GarethD - The assumptions by MartinSmith are not applicable, the random list is almost 10x shorter than the actual number of patient rows. So when I use this answer it still produces the same result as my first attempt. It select the 'random' data once and fills the complete patients table with it. I'll go and read the Q&A on random subqueries. – Frank Kaaijk Feb 06 '13 at 11:45
  • @MartinSmith Fair point, I did countless testing and could not get the right behaviour with cross apply, but it turns out it was because I wasn't using a primary key from the the table to update within the order by. I've reverted to my original solution. – GarethD Feb 06 '13 at 11:57
  • @GarethD: This query doesn't run when I use `ORDER BY NEWID()`. SQL Server returns:`Msg 8152, Level 16, State 4, Line 1 String or binary data would be truncated. The statement has been terminated.` Why would you use order by newid()?? I thought that `TableSample` did that part for me. Doing seperate queries when the 'random' select returns a different record without `newid()` – Frank Kaaijk Feb 06 '13 at 12:00
  • How do I post the CREATE SCRIPT? – Frank Kaaijk Feb 06 '13 at 12:08
  • Here's a link on TableSample [link](http://stackoverflow.com/questions/10725839/tablesample-returns-wrong-number-of-rows) – Frank Kaaijk Feb 06 '13 at 12:16
  • 1
    I used order by newid to make the sorting random, although TABLESAMPLE does this it will only create one table sample for the whole update, using ORDER BY NEWID(), PatientID forces the subquery to reevaluate for every row. Are you sure NEWID is the cause of the error? The error doesn't sound like anything to do with NEWID, it sounds like you are trying to update a column with something too large for it's datatype. – GarethD Feb 06 '13 at 12:30
  • 1
    Using `newid()` with `tablesample` is a good idea anyway as otherwise `TOP 1` will always give you the first row on a page. It just picks random pages and doesn't randomise the results. – Martin Smith Feb 06 '13 at 12:33
  • @Gareth - You're right. The streetaddress/name is `nvarchar(75)` in the random table and `nvarchar(50)` in the patient table... – Frank Kaaijk Feb 06 '13 at 12:54
0

As an additional idea, the problem of GarethD's method is, that it requires more or an equal number of rows in the second table as in the first table.

So you can just do a cross join of the second table with the first one, and limit the results to the number of rows in the first table.

WITH PatientCTE AS
(
    SELECT  
         PatientID 
        ,pat_FirstName 
        ,pat_LastName 
        ,pat_StreetName 
        ,pat_PostalCode 
        ,pat_City 
        ,pat_DateOfBirth 
        ,rn = ROW_NUMBER() OVER(ORDER BY NEWID()) 
    FROM Patients
)
, SampleData AS
(
    SELECT TOP (SELECT COUNT(*) FROM PatientCTE )  
             GivenName 
            ,SurName 
            ,StreetAddress 
            ,ZipCode 
            ,City 
            ,Birthday 
            ,rn = ROW_NUMBER() OVER(ORDER BY NEWID())
    FROM FakeNameGenerator 

    CROSS JOIN PatientCTE 
)

UPDATE p
SET      p.pat_FirstName = fn.GivenName 
        ,p.pat_LastName = fn.SurName 
        ,p.pat_StreetName = fn.StreetAddress 
        ,p.pat_PostalCode = fn.ZipCode 
        ,p.pat_City = fn.City 
        ,p.pat_DateOfBirth = fn.BirthDay 
FROM PatientCTE AS p

INNER JOIN SampleData AS fn
    ON fn.rn = p.rn
Stefan Steiger
  • 78,642
  • 66
  • 377
  • 442