Iterate through table, perform calculation on each row

Question

I would like to preface this by saying I am VERY new to SQL, but my work now requires that I work in it.

I have a dataset containing topographical point data (x,y,z). I am trying to build a KNN model based on this data. For every point 'P', I search for the 100 points in the data set nearest P (nearest meaning geographically nearest). I then average the values of these points (this average is known as a residual), and add this value to the table in the 'resid' column.

As a proof of concept, I am trying to simply iterate over the table, and set the value of the 'resid' column to 1.0 in every row.

My query is this:

CREATE OR REPLACE FUNCTION LoopThroughTable() RETURNS VOID AS '
DECLARE row table%rowtype;
BEGIN
    FOR row in SELECT * FROM table LOOP
        SET row.resid = 1.0;
    END LOOP;
END

' LANGUAGE 'plpgsql';

SELECT LoopThroughTable() as output;

This code executes and returns successfully, but when I check the table, no alterations have been made. What is my error?

So if I execute the UPDATE query after the loop has finished, will that commit the changes to the table? — Canadian_Marine, Mar 08 '16 at 20:08

score 23 · Accepted Answer · answered Mar 08 '16 at 22:09

Doing updates row-by-row in a loop is almost always a bad idea and will be extremely slow and won't scale. You should really find a way to avoid that.

After having said that:

All your function is doing is to change the value of the column value in memory - you are just modifying the contents of a variable. If you want to update the data you need an update statement:

You need to use an UPDATE inside the loop:

CREATE OR REPLACE FUNCTION LoopThroughTable() 
  RETURNS VOID 
AS
$$
DECLARE 
   t_row the_table%rowtype;
BEGIN
    FOR t_row in SELECT * FROM the_table LOOP
        update the_table
            set resid = 1.0
        where pk_column = t_row.pk_column; --<<< !!! important !!!
    END LOOP;
END;
$$ 
LANGUAGE plpgsql;

Note that you have to add a where condition on the primary key to the update statement otherwise you would update all rows for each iteration of the loop.

A slightly more efficient solution is to use a cursor, and then do the update using where current of

CREATE OR REPLACE FUNCTION LoopThroughTable() 
  RETURNS VOID 
AS $$
DECLARE 
   t_curs cursor for 
      select * from the_table;
   t_row the_table%rowtype;
BEGIN
    FOR t_row in t_curs LOOP
        update the_table
            set resid = 1.0
        where current of t_curs;
    END LOOP;
END;
$$ 
LANGUAGE plpgsql;

So if I execute the UPDATE query after the loop has finished, will that commit the changes to the table?

No. The call to the function runs in the context of the calling transaction. So you need to commit after running SELECT LoopThroughTable() if you have disabled auto commit in your SQL client.

Note that the language name is an identifier, do not use single quotes around it. You should also avoid using keywords like row as variable names.

Using dollar quoting (as I did) also makes writing the function body easier

Excellent. This was the answer I was looking for. Thanks very much for your time. — Canadian_Marine, Mar 09 '16 at 22:48
What about if you have to alter multiple tables based on that row? — swade, Dec 20 '19 at 15:09

score 2 · Answer 2 · answered Mar 08 '16 at 21:27

I'm not sure if the proof of concept example does what you want. In general, with SQL, you almost never need a FOR loop. While you can use a function, if you have PostgreSQL 9.3 or later, you can use a LATERAL subquery to perform subqueries for each row.

For example, create 10,000 random 3D points with a random value column:

CREATE TABLE points(
  gid serial primary key,
  geom geometry(PointZ),
  value numeric
);
CREATE INDEX points_geom_gist ON points USING gist (geom);
INSERT INTO points(geom, value)
SELECT ST_SetSRID(ST_MakePoint(random()*1000, random()*1000, random()*100), 0), random()
FROM generate_series(1, 10000);

For each point, search for the 100 nearest points (except the point in question), and find the residual between the points' value and the average of the 100 nearest:

SELECT p.gid, p.value - avg(l.value) residual
FROM points p,
  LATERAL (
    SELECT value
    FROM points j
    WHERE j.gid <> p.gid
    ORDER BY p.geom <-> j.geom
    LIMIT 100
) l
GROUP BY p.gid
ORDER BY p.gid;

Does a lateral subquery need to be a select then? The only examples I've ever seen is a select since it is basically another FROM. Can you actually do updates to multiple tables in this way? — swade, Dec 20 '19 at 15:13

score -1 · Answer 3 · answered Apr 15 '20 at 07:06

Following is a simple example to update rows in a table:

Assuming the row id field id

Update all rows:

UPDATE my_table SET field1='some value'
WHERE id IN (SELECT id FROM staff)

Selective row update

UPDATE my_table SET field1='some value'
WHERE id IN (SELECT id FROM staff WHERE field2='same value')

score -2 · Answer 4 · answered Mar 08 '16 at 07:51

You don't need a function for that. All you need is to run this query:

UPDATE table SET resid = 1.0;

if you want to do it with a function you can use SQL function:

CREATE OR REPLACE FUNCTION LoopThroughTable()
  RETURNS VOID AS
$BODY$
UPDATE table SET resid = 1.0;
$BODY$
  LANGUAGE sql VOLATILE

if you want to use plpgsql then function would be:

CREATE OR REPLACE FUNCTION LoopThroughTable()
  RETURNS void AS
$BODY$
begin
       UPDATE table SET resid = 1.0;
end;
$BODY$
  LANGUAGE plpgsql VOLATILE

Note that it is not recommended to use plpgsql functions for tasks that can be done with Sql functions.

Again, I realize that for this particular functionality I do not need a for loop. All I am doing is trying to determine the proper syntax for iterating over every row in the table, because the ultimate goal is to perform a calculation on each row. — Canadian_Marine, Mar 08 '16 at 20:10

Iterate through table, perform calculation on each row

4 Answers4

Linked