1

I would like to to shuffle some columns from a table in Postgres database. I have 2 millions rows. I need to update all not null values by another.

I need to keep the same dataset. It's not possible to have the same value two times. It's not possible to swap data with next values because, if I do the same process with another column I will keep the same link. It's to anonymise my database. Just need to shuffle data and keep the dataset.

exemple (change firstname and lastname):

id firstname lastname
1 albert einsten
2 isaac newton
3 curie
4 alexandre Graham Bell
5 thomas Edison

shuffle firstname column:

id firstname lastname
1 isaac Graham Bell
2 albert Edison
3 einsten
4 thomas newton
5 alexandre curie

How to do this with a speedy process?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
user3659832
  • 41
  • 1
  • 6
  • Are you just randomly shuffling the data in the column or is there an actual pattern? (Or, why did you swap 1/2 and 4/5, not, for example, 1/4 and 2/5)? – EdmCoff Oct 13 '22 at 22:21
  • Yes I want to randomly shuffling the data. Swap 1/2 or 1/4 or 1/5 is good, just I need to avoid to have isaac again or null value. My problem is how to avoid to have two time the same value in the colum. For example only one time isaac, albert, thomas, and alexandre but in random order, like my exemple – user3659832 Oct 13 '22 at 22:27

2 Answers2

0

Given the updated requirement, you might be better off using a strategy like Erwin Brandstetter's solution that can easily be applied to an arbitrary number of columns, however I will leave my original answer with an update for doing the second column.

Original answer (requirement to shuffle one column):

Given how general your requirements are about the order of the shuffle, I'm not sure how much this will help practically, but I think it answers your question:

update test
SET firstname = t2.firstname
FROM
(
 SELECT id, COALESCE(LAG(firstname, 1) OVER (ORDER BY id RANGE UNBOUNDED PRECEDING), LAST_VALUE(firstname) OVER (ORDER BY id RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) firstname
 FROM test t2
 WHERE firstname IS NOT NULL
) t2
WHERE test.id = t2.id

The idea here is that the inner query gets the values shifted by one (ignoring nulls). The COALESCE is used since the first one doesn't have a proceeding entry, so it falls back on some LAST_VALUE logic to get the last value (i.e. it behaves as if the shift loops around).

The surrounding UPDATE statement joins test to the subquery to actually update the data.

You can see it working in this Fiddle.

Updated (requirement to also shuffle a second column):

Given the updated requirement to also shuffle a second field, you could apply different logic there:

update test
SET firstname = t2.firstname,
  lastname = t2.lastname
FROM
(
 SELECT id, 
  COALESCE(LAG(firstname, 1) OVER (ORDER BY id), LAST_VALUE(firstname) OVER (ORDER BY id RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)) firstname,
  COALESCE(LEAD(lastname, 1) OVER (ORDER BY id), FIRST_VALUE(lastname) OVER (ORDER BY id RANGE UNBOUNDED PRECEDING)) lastname
 FROM test t2
 WHERE firstname IS NOT NULL AND lastname IS NOT NULL
) t2
WHERE test.id = t2.id

This simply shuffles the lastname in the opposite direction, so firstname is grabbed from the previous non-null row (wrapping around) and lastname is grabbed from the next non-null row (wrapping around). Both columns will be changed.

Here is a fiddle of it working

EdmCoff
  • 3,506
  • 1
  • 9
  • 9
  • Thank for your reply. I forgot to talk about a rule... It's not possible for my to have a rule to move data to one step. Because it's for anonymise data, I will random some colums. For example in my example I will be shuffle firstname and last name. So, if it's the same logic, I will have the same data. I will edit my question, sorry for forgetting this information – user3659832 Oct 13 '22 at 23:17
  • @user3659832 Is the problem that you don't want the last name to stay on the same id or that it's not random because it was only shifted by one? – EdmCoff Oct 13 '22 at 23:37
  • The problem is apply the same logic with another column for example lastname, the is no change only all id will change. lastname and firstname will keep the same combinaison. – user3659832 Oct 14 '22 at 05:48
  • @user3659832 I have updated my answer to shuffle the other column. – EdmCoff Oct 14 '22 at 14:29
0

This shuffles values in the column firstname in a perfectly random fashion:

UPDATE test t0
SET    firstname = t2.firstname
FROM  (SELECT row_number() OVER (ORDER BY random()) AS rn, id        FROM test WHERE firstname IS NOT NULL) t1
JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, firstname FROM test WHERE firstname IS NOT NULL) t2 USING (rn)
WHERE  t0.id = t1.id
AND    t0.firstname IS NOT NULL;

"Perfectly random" includes the possibility that some columns might retain their original values. (The more rows the smaller the chance.) This is actually best for anonymizing data. Then values are truly random. If we force a switch, readers will get the minimum information that a different value was associated with a given ID.

It also observes your surprising rule to only shuffle not null values.

Repeat for every column you need to shuffle.

Without excluding NULL values, this single query works more cheaply for any number of columns:

UPDATE test t0
SET    firstname = t2.firstname
     , lastname  = t3.lastname
FROM  (SELECT row_number() OVER (ORDER BY random()) AS rn, id        FROM test) t1
JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, firstname FROM test) t2 USING (rn)
JOIN  (SELECT row_number() OVER (ORDER BY random()) AS rn, lastname  FROM test) t3 USING (rn)
WHERE  t0.id = t1.id;

If id is a gap-less sequence we can remove t1 from the equation, and join t0.id to t2.rn. (Wouldn't work while excluding NULL values.)

fiddle

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • not possible to avoid having the same value after random ? – user3659832 Oct 14 '22 at 13:04
  • All possible. But a truly random distribution is the best solution to anonymize. Either way, the case is not clear enough for me to spend more time. Exact table definition, Postgres version. Why freeze NULL values? Role of ID values? – Erwin Brandstetter Oct 14 '22 at 23:23