1

Background

I've got a PostgreSQL table, db1, that's got several columns in it but no primary key. I need to add one. Postgres is version 13-point-something, running on a local server on a 64bit Windows PC. Here's a toy version of the table:

+-------------------+------------+
|member             |service_date|
+-------------------+------------+
|eof81j4            |2011-01-06  |
|eof81j4            |2010-06-03  |
|eof81j4            |2010-01-12  |
|eof81j4            |2011-05-21  |
|j42roit            |2015-11-29  |
|j42roit            |2015-11-29  |
|j42roit            |2015-11-29  |
|p8ur0fq            |2014-01-13  |
|p8ur0fq            |2016-04-04  |
|p8ur0fq            |2014-01-13  |
|vplhbun            |2019-08-15  |
|vplhbun            |2019-08-15  |
|vplhbun            |2019-08-15  |
|akj3vie            |2009-03-31  |
+-------------------+------------+

Note here that the table isn't ordered in any obvious way: the member column isn't in ABC order, say, and the service_date column isn't ordered chronologically or anything.

The Problem

Typically, I'd add a primary key to a table like so:

ALTER TABLE db1 ADD COLUMN id SERIAL PRIMARY KEY

And that works just fine. However, because of some queries I have to run later on, I'd like the primary key to be applied as if the table were ordered (sorted, in other words) first by member (ascending alphabetical order) and, within that, chronologically (from oldest to newest) by service date. In other words I'd like something like this:

+-------------------+------------+--+
|member             |service_date|id|
+-------------------+------------+--+
|akj3vie            |2009-03-31  |1 |
|eof81j4            |2010-01-12  |2 |
|eof81j4            |2010-06-03  |3 |
|eof81j4            |2011-01-06  |4 |
|eof81j4            |2011-05-21  |5 |
|j42roit            |2015-11-29  |6 |
|j42roit            |2015-11-29  |7 |
|j42roit            |2015-11-29  |8 |
|p8ur0fq            |2014-01-13  |9 |
|p8ur0fq            |2014-01-13  |10|
|p8ur0fq            |2016-04-04  |11|
|vplhbun            |2019-08-15  |12|
|vplhbun            |2019-08-15  |13|
|vplhbun            |2019-08-15  |14|
+-------------------+------------+--+

What I've Tried

Attempts at cracking this have gone something like this:

ALTER TABLE db1 ADD COLUMN id SERIAL PRIMARY KEY ORDER BY member, service_date 

But that yields [42601] ERROR: syntax error at or near "ORDER". I've looked for similar posts on SO, but solutions seem to involve creating a new table, and I'd prefer not to do that (as the real table I'm going to be doing this on is very very large).

EDIT

Upon reflection, and thanks in part to Gordon Linoff's answer below, I'm realizing that I don't actually need this new column to be a Primary Key in the formal sense that Postgres recognizes it as such. All I really need is a column with unique values for each row that are ordered in the way I described. Whether it's formally a PK or not doesn't actually matter. (In the original title and body of this post, I had asked for a Primary Key. I've edited things to reflect that that's not strictly necessary.)

logjammin
  • 1,121
  • 6
  • 21
  • 1
    There is no `ORDER BY` on a `PRIMARY KEY`. SQL data is unordered unless you use an `ORDER BY` in the selection query. – Adrian Klaver Sep 04 '21 at 19:21
  • So is what I'm asking for help with not possible, in your view? – logjammin Sep 04 '21 at 19:22
  • 1
    It is not possible. The closest you can come is [CLUSTER](https://www.postgresql.org/docs/12/sql-cluster.html) but that is basically a temporary thing for any table in which data is being updated. You can also set an `INDEX`, but first I would establish that it really necessary as an index has overhead. Right now I think you are in the realm of premature optimization. – Adrian Klaver Sep 04 '21 at 19:24
  • Can I order the table first somehow, and then apply the PK as I normally would? – logjammin Sep 04 '21 at 19:26
  • 1
    Never design a table where you need a surrogate key to have any property other than uniqueness. A huge list of events can cause gaps, out of order allocation, etc. Where order is relevent, use the `service_date` ***in your queries***, such as `ROW_NUMBER() OVER (ORDER BY member, service_date)`. *(To be more clear about your specific question; the answer is `don't try, you can't reliably achieve this, and should never need to`.)* – MatBailie Sep 04 '21 at 19:28
  • 1
    I repeat: 'SQL data is unordered'. And as of now you have not established there is a problem. – Adrian Klaver Sep 04 '21 at 19:28
  • 1
    Representing/managing a(n ordered) list in a relational/SQL DB is a faq. – philipxy Sep 04 '21 at 23:29

1 Answers1

2

One approach is to add the id column, set the value and then convert it to an identity. You have to be careful, because you want the numbering to start at 15 (for your sample data):

alter table db1 add column id int not null default 0;
update db1
    set id = d.new_id
    from (select db1.*, db1.ctid, row_number() over (order by member, service_date) as new_id
          from db1
         ) d
    where db1.ctid = d.ctid;

alter table db1 alter column id drop default;

alter table db1 alter column id add generated always as identity  (start with 15) ;

alter table db1 add primary key (id);

Here is a db<>fiddle.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Just know an `IDENITY` is backed by `sequence` and a `sequence` is not guaranteed to be gap less or necessarily sequential. Per @MatBailie you are counting on something to happen that may very well not. – Adrian Klaver Sep 04 '21 at 19:40
  • @Gordon, this works just fine on my end. I'm not sure that I'm going to use it on my real dataset, if for no other reason that I won't be able to audit whether it works properly on all ~200m rows, but I'd say this is a great proof of concept. Reflecting now, I don't actually need the column to formally be PK -- I just needed a unique sequence of some kind according to the parameters I mentioned. I'll update my post to reflect this. – logjammin Sep 04 '21 at 20:03
  • 1
    @logjammin . . . You can use the same code, just remove the primary key constraint. However a serial/identity column would normally be the primary key anyway. – Gordon Linoff Sep 04 '21 at 20:34