-1

I'm developing a tagging system where each user has a list of tags associated with them, and I'm storing them in a text[] column. Here's an example:

user_id: int4    tags: text[]
-------------    ------------
1                [ 'apple', 'banana', 'carrot', 'jelly' ]
2                [ 'jelly', 'zebra' ]

My server has a route called update-tags that replaces a user's tags with a new list.

Now, I would like to be able to efficiently query for the entire list of tags and the count associated with each. The aforementioned example would return:

tag: text      count: int4
---------      -----------
'apple'        1
'banana'       1
'carrot'       1
'jelly'        2
'zebra'        1

Assuming there were 10 million users, and each user had around 100 tags associated with them, and there were 10,000 tags total (ballpark numbers), doing a full-table scan of all users and de-duping all their tags is cost-prohibitive.

Tags for users change very rarely, but they are queried very often. I would like to optimize for read-speed and not write-speed.

Also, the tag counts do not have to be 100% accurate, but eventually consistent instead.

The use of triggers seems along the right path. I initially had the thought that whenever a user's tags column changed, to have a trigger to increment / decrement the counts, but what if many of the replacements were the same? Wouldn't it drift off over time? Is there a better way?

Ryan Peschel
  • 11,087
  • 19
  • 74
  • 136
  • Sounds like you want triggers when the tags are inserted, deleted, or updated. – Gordon Linoff Feb 14 '20 at 19:46
  • 1
    Learn about normalzation and do not store values as csv. – Jens Feb 14 '20 at 19:54
  • Jens: Are you sure about this? I did some research and it looks like a lot of SQL experts are recommending `JSONB / TEXT[]` columns for tagging: http://www.databasesoup.com/2015/01/tag-all-things.html?m=1 – Ryan Peschel Feb 14 '20 at 19:57
  • @Jens saving them as a separate relational table would do nothing to answer the question. It would still take a long time to tally them. – jjanes Feb 14 '20 at 21:03

1 Answers1

0

Also, the tag counts do not have to be 100% accurate, but eventually consistent instead.

I don't think that this leeway really gets you anything with PostgreSQL, unless you set up logical replication, I don't see how that offers any advantages here.

I also think the insert, update, and delete trigger which update a table with (tag pk, count) fields is the right path. The disadvantage is that if two sessions are trying to change the count of the same tag at the same time, one would have to block till the other committed. But since you say such changes should be very rare, I don't think this would be too much of a problem. You could also get deadlocks if you don't update rows of the tag table in a consistent order.

but what if many of the replacements were the same? Wouldn't it drift off over time?

I don't see where any drift would come from, if implemented correctly.

jjanes
  • 37,812
  • 5
  • 27
  • 34
  • I've heard of HyperLogLog for the "not 100% accurate" aspect, but not sure if it applies here. – Ryan Peschel Feb 15 '20 at 01:13
  • hyperloglog works where you want a count of how many distinct things there are, like `count(distinct thing)`. But you want a count of many of each distinct thing there are, like `thing, count(*)...group by thing`. So I don't think it would apply here. – jjanes Feb 15 '20 at 15:23