4

Consider the following example table

x_id name_id1 name_id2
x1 John Frank
x2 Frank John
x3 Jack John
x4 John Jack
x5 Bob Frank
x6 George Bob
x7 Bob Finn
x8 Mark James
x9 James Finn

The goal is to extract the following result

name frequency
John 4
Bob 3
James 2

Conceptually this is the result of the following procedure:

  1. Count the frequency of all names and pick the one that occurs most often, that would be John, which occurs 4 times (row x1 to x4).
  2. Now remove all rows that contain John, which leaves you with row x5 to x9. Again determine the name that occurs most frequently. This gives you Bob, which occurs 3 times (row x5 to x7).
  3. Now also remove row x5 to x7, which leaves us with row x8 to x9. Again determine the name that occurs most frequently. This gives us James which occurs 2 times.
  4. Now also remove row x8 to x9, which leaves us with nothing, so we are done.

This data is stored in SQLite in a junction table as follows (in the real world case there can be more than two names per x_id)

id x_id name_id
1 x1 John
2 x1 Frank
3 x2 John
4 x2 Frank
5 x3 John
6 x3 Jack
7 x4 John
8 x4 Jack
9 x5 Bob
10 x5 Frank
11 x6 Bob
12 x6 George
13 x7 Bob
14 x7 Finn
13 x8 James
14 x8 Mark
13 x9 James
14 x9 Finn

What kind of procedure do we need to retrieve the described result? Considering that the above junction table is of variable length (just to make sure we do not come up with a fixed amount of joins as a valid answer).

I did consider using the WITH RECURSIVE approach however that will not allow us to

  1. Execute the aggregate function COUNT in the recursive select which seems to be required when we want to calculate the occurrences.
  2. Remove all previous x_ids, only those that are currently in queue.
N Meibergen
  • 362
  • 2
  • 14

1 Answers1

0

I have never used SQLite but recursive queries are available in many DBMSs, unfortunately I do not think they are powerful enough to do what you want. The problem is that you need to incrementally ignore more and more rows if they contain a name already present in the result table. I am not 100% sure, but I believe it is impossible to do this with a recursive CTE.


Considering that SQLite does not provide an extension language allowing loops (Here on SO you can find some way to replicate loops, like with recursive CTE but I am not sure it is possible in your case), one of the possible reasonable choices is partially handling this at application-side.

  1. First of all, I would create a temporary table like this one:

    CREATE TEMP TABLE visitedName (
      x_id VARCHAR(10),
      is_visited BOOLEAN
    );
    

    x_id is the same as the x_id column in your table (that I call mytable), is_visited is a boolean specifying whether the x_id must be ignored.

  2. Then you fill visitedName with all the distinct values of x_id:

    INSERT into visitedName
    SELECT DISTINCT x_id, FALSE
    FROM mytable
    
  3. Now you need to find the most popular name ignoring the rows where is_visited is true:

    SELECT mt.name_id, COUNT(mt.name_id) name_count
    FROM mytable mt JOIN visitedName vn ON mt.x_id = vn.x_id
    WHERE NOT vn.is_visited
    GROUP BY mt.name_id
    ORDER BY name_count DESC
    LIMIT 1
    

    At application-side you should retrieve the only row returned and extract the name_id, that is passed as parameter to the next query. If it has not returned any rows, then you are done and you can DROP TABLE visitedName.

  4. In the end you update the visitedName table, marking as visited all the x_ids containing at least one occurence of the name_id got before.

    UPDATE visitedName
    SET is_visited = TRUE
    WHERE x_id IN (
      SELECT x_id
      FROM mytable
      WHERE name_id = ? -- name_id got before
    )
    
  5. Restart from point 3.

If your connection latency is too high or the number of rows you expect in the resulting table are too many, the continuous communication between client-application and DBMS is a considerable overhead. In that case, you need something more advanced.


For those interested in replicating the table given by the OP, here is the script I have used:

CREATE TABLE mytable(
  id integer,
  x_id varchar(10),
  name_id varchar(50),
  PRIMARY KEY (id)
);

INSERT INTO mytable (id, x_id, name_id) VALUES
  (1, 'x1', 'John'),
  (2, 'x1', 'Frank'),
  (3, 'x2', 'John'),
  (4, 'x2', 'Frank'),
  (5, 'x3', 'John'),
  (6, 'x3', 'Jack'),
  (7, 'x4', 'John'),
  (8, 'x4', 'Jack'),
  (9, 'x5', 'Bob'),
  (10, 'x5', 'Frank'),
  (11, 'x6', 'Bob'),
  (12, 'x6', 'George'),
  (13, 'x7', 'Bob'),
  (14, 'x7', 'Finn'),
  (15, 'x8', 'James'),
  (16, 'x8', 'Mark'),
  (17, 'x9', 'James'),
  (18, 'x9', 'Finn');
Marco Luzzara
  • 5,540
  • 3
  • 16
  • 42
  • I appreciate your answer. Indeed once such list becomes large and/or their is connection latency, it might create considerable overhead. I will leave the question unanswered until we have a fully SQLite based answer. – N Meibergen Aug 20 '21 at 07:52