Recursively count occurrences with sqlite

Question

Consider the following example table

x_id	name_id1	name_id2
x1	John	Frank
x2	Frank	John
x3	Jack	John
x4	John	Jack
x5	Bob	Frank
x6	George	Bob
x7	Bob	Finn
x8	Mark	James
x9	James	Finn

The goal is to extract the following result

name	frequency
John	4
Bob	3
James	2

Conceptually this is the result of the following procedure:

Count the frequency of all names and pick the one that occurs most often, that would be John, which occurs 4 times (row x1 to x4).
Now remove all rows that contain John, which leaves you with row x5 to x9. Again determine the name that occurs most frequently. This gives you Bob, which occurs 3 times (row x5 to x7).
Now also remove row x5 to x7, which leaves us with row x8 to x9. Again determine the name that occurs most frequently. This gives us James which occurs 2 times.
Now also remove row x8 to x9, which leaves us with nothing, so we are done.

This data is stored in SQLite in a junction table as follows (in the real world case there can be more than two names per x_id)

id	x_id	name_id
1	x1	John
2	x1	Frank
3	x2	John
4	x2	Frank
5	x3	John
6	x3	Jack
7	x4	John
8	x4	Jack
9	x5	Bob
10	x5	Frank
11	x6	Bob
12	x6	George
13	x7	Bob
14	x7	Finn
13	x8	James
14	x8	Mark
13	x9	James
14	x9	Finn

What kind of procedure do we need to retrieve the described result? Considering that the above junction table is of variable length (just to make sure we do not come up with a fixed amount of joins as a valid answer).

I did consider using the WITH RECURSIVE approach however that will not allow us to

Execute the aggregate function COUNT in the recursive select which seems to be required when we want to calculate the occurrences.
Remove all previous x_ids, only those that are currently in queue.

What if there are 4 Johns and 4 Bobs? Which one should be picked first? — forpas, Aug 05 '21 at 10:31
Good question, in that case we may pick alphabetically, in reality there is no preference between the two. — N Meibergen, Aug 05 '21 at 10:57
@forpas For what it’s worth: it would also be acceptable to return all items that have a tie in the frequency, if that makes the query easier/feasible. — N Meibergen, Aug 05 '21 at 20:58
Why does Frank not get counted? If you didn't remove his name as well as John in the first cut, he would be joint second highest. — Nick, Aug 08 '21 at 12:09

score 0 · Answer 1 · answered Aug 07 '21 at 23:26

I have never used SQLite but recursive queries are available in many DBMSs, unfortunately I do not think they are powerful enough to do what you want. The problem is that you need to incrementally ignore more and more rows if they contain a name already present in the result table. I am not 100% sure, but I believe it is impossible to do this with a recursive CTE.

Considering that SQLite does not provide an extension language allowing loops (Here on SO you can find some way to replicate loops, like with recursive CTE but I am not sure it is possible in your case), one of the possible reasonable choices is partially handling this at application-side.

First of all, I would create a temporary table like this one:
```
CREATE TEMP TABLE visitedName (
  x_id VARCHAR(10),
  is_visited BOOLEAN
);
```
x_id is the same as the x_id column in your table (that I call mytable), is_visited is a boolean specifying whether the x_id must be ignored.

Then you fill visitedName with all the distinct values of x_id:

INSERT into visitedName
SELECT DISTINCT x_id, FALSE
FROM mytable

Now you need to find the most popular name ignoring the rows where is_visited is true:
```
SELECT mt.name_id, COUNT(mt.name_id) name_count
FROM mytable mt JOIN visitedName vn ON mt.x_id = vn.x_id
WHERE NOT vn.is_visited
GROUP BY mt.name_id
ORDER BY name_count DESC
LIMIT 1
```
At application-side you should retrieve the only row returned and extract the name_id, that is passed as parameter to the next query. If it has not returned any rows, then you are done and you can DROP TABLE visitedName.
In the end you update the visitedName table, marking as visited all the x_ids containing at least one occurence of the name_id got before.
```
UPDATE visitedName
SET is_visited = TRUE
WHERE x_id IN (
  SELECT x_id
  FROM mytable
  WHERE name_id = ? -- name_id got before
)
```
Restart from point 3.

If your connection latency is too high or the number of rows you expect in the resulting table are too many, the continuous communication between client-application and DBMS is a considerable overhead. In that case, you need something more advanced.

For those interested in replicating the table given by the OP, here is the script I have used:

CREATE TABLE mytable(
  id integer,
  x_id varchar(10),
  name_id varchar(50),
  PRIMARY KEY (id)
);

INSERT INTO mytable (id, x_id, name_id) VALUES
  (1, 'x1', 'John'),
  (2, 'x1', 'Frank'),
  (3, 'x2', 'John'),
  (4, 'x2', 'Frank'),
  (5, 'x3', 'John'),
  (6, 'x3', 'Jack'),
  (7, 'x4', 'John'),
  (8, 'x4', 'Jack'),
  (9, 'x5', 'Bob'),
  (10, 'x5', 'Frank'),
  (11, 'x6', 'Bob'),
  (12, 'x6', 'George'),
  (13, 'x7', 'Bob'),
  (14, 'x7', 'Finn'),
  (15, 'x8', 'James'),
  (16, 'x8', 'Mark'),
  (17, 'x9', 'James'),
  (18, 'x9', 'Finn');

I appreciate your answer. Indeed once such list becomes large and/or their is connection latency, it might create considerable overhead. I will leave the question unanswered until we have a fully SQLite based answer. — N Meibergen, Aug 20 '21 at 07:52

Recursively count occurrences with sqlite

1 Answers1