I'm developing a tagging system where each user has a list of tags associated with them, and I'm storing them in a text[]
column. Here's an example:
user_id: int4 tags: text[]
------------- ------------
1 [ 'apple', 'banana', 'carrot', 'jelly' ]
2 [ 'jelly', 'zebra' ]
My server has a route called update-tags
that replaces a user's tags with a new list.
Now, I would like to be able to efficiently query for the entire list of tags and the count associated with each. The aforementioned example would return:
tag: text count: int4
--------- -----------
'apple' 1
'banana' 1
'carrot' 1
'jelly' 2
'zebra' 1
Assuming there were 10 million users, and each user had around 100 tags associated with them, and there were 10,000 tags total (ballpark numbers), doing a full-table scan of all users and de-duping all their tags is cost-prohibitive.
Tags for users change very rarely, but they are queried very often. I would like to optimize for read-speed and not write-speed.
Also, the tag counts do not have to be 100% accurate, but eventually consistent instead.
The use of triggers seems along the right path. I initially had the thought that whenever a user's tags
column changed, to have a trigger to increment / decrement the counts, but what if many of the replacements were the same? Wouldn't it drift off over time? Is there a better way?