9

In a word game similar to Ruzzle or Letterpress, where users have to construct words out of a given set of letters:

enter image description here

I keep my dictionary in a simple SQL table:

create table good_words (
        word varchar(16) primary key
);

Since the game duration is very short I do not want to check every entered word by calling a PHP script, which would look that word up in the good_words table.

Instead I'd like to download all possible words by one PHP script call before the round starts - since all letters are known.

My question is: if there is a nice SQLish way to find such words?

I.e. I could run a longer-taking script once to add a column to good_words table, which would have same letters as in the word columnt, but sorted alphabetically... But I still can't think of a way to match for it given a set of letters.

And doing the word matching inside of a PHP script (vs. inside the database) would probably take too long (because of bandwidth: would have to fetch every row from the database to the PHP script).

Any suggestions or insights please?

Using postgresql-8.4.13 with CentOS Linux 6.3.

UPDATE:

Other ideas I have:

  1. Create a constantly running script (cronjob or daemon) which would prefill an SQL table with precompiled letters board and possible words - but still feels like a waste of bandwidth and CPU, I would prefer to solve this inside the database
  2. Add integer columns a, b, ... , z and whenever I store a word into good_words, store the letter occurences there. I wonder if it is possible to create an insert trigger in Pl/PgSQL for that?
Alexander Farber
  • 21,519
  • 75
  • 241
  • 416
  • A) that's probably still going to be a *very long list* of words that needs to be downloaded there, b) that gives a technical user a great way to cheat. ;) – deceze Mar 05 '13 at 09:46
  • Actually not: Ruzzle reports number of possible words at the end of rounds and that number rarely exceeds 300. Even with assumed word-length of 10 letters that would be merely 3 kbyte - before gzipping. – Alexander Farber Mar 05 '13 at 10:23
  • Can you upload a CSV dump of `good_words` table somewhere to play with? Or provide another source, please? – vyegorov Mar 05 '13 at 11:00
  • 1
    This will be really inefficient. You should almost certainly cache the wordlist from the database as a compressed file that the client does a simple HTTP GET request for. You can set if-modified HTTP headers to avoid re-fetching it if it hasn't changed. If this wordlist becomes too big, what you do is create a subset of most commonly tried words (not necessarily correct words) that you put in the client cache, and if the client cache doesn't know the word it asks the server about it via a web service request. Client can add the downloaded wordlist to SQLite, BDB, or similar for fast lookup. – Craig Ringer Mar 05 '13 at 11:12
  • +1 for interesting idea about pre-caching common words in the app :-) – Alexander Farber Mar 05 '13 at 12:14

7 Answers7

4

Nice question, I upvoted.

What you're up to is a list of all possible permutations of the given letters of a given length. As described in the PostgreSQL wiki, you can create a function and call it like this (matches highlighted letters in your screenshot):

SELECT * FROM permute('{E,R,O,M}'::text[]);

Now, to query the good_words use something like:

SELECT gw.word, gw.stamp
  FROM good_words gw
  JOIN permute('{E,R,O,M}'::text[]) s(w) ON gw.word=array_to_string(s.w, '');
vyegorov
  • 21,787
  • 7
  • 59
  • 73
  • Like a join with a temp. table generated by that proc? Nice idea! However I think it is better to have the list of good words as *origin* vs. generating a list of all possible letter permutations - many of which won't be valid words... – Alexander Farber Mar 05 '13 at 10:24
2

This could be a start, except that it doesn't check if we have enough letters, only if he have the right letters.

SELECT word from
(select word,generate_series(0,length(word)) as s from good_words) as q
WHERE substring(word,s,1) IN ('t','h','e','l','e','t','t','e','r','s')
GROUP BY word
HAVING count(*)>=length(word);

http://sqlfiddle.com/#!1/2e3a2/3

EDIT:

This query select only the valid words though it seems a bit redundant. It's not perfect but certainly proves it can be done.

WITH words AS 
(SELECT word, substring(word,s,1) as sub from
(select word,generate_series(1,length(word)) as s from good_words) as q
WHERE substring(word,s,1) IN ('t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s'))

SELECT w.word FROM
(
SELECT word,words.sub,count(DISTINCT s) as cnt FROM
(SELECT s, substring(array_to_string(l, ''),s,1) as sub FROM
(SELECT l, generate_subscripts(l,1) as s FROM 
 (SELECT ARRAY['t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s'] as l) 
 as q) 
as q) as let JOIN
words ON let.sub=words.sub
GROUP BY words.word,words.sub) as let
JOIN
(select word,sub,count(*) as cnt from words
 GROUP BY word, sub)
as w ON let.word=w.word AND let.sub=w.sub AND let.cnt>=w.cnt
GROUP BY w.word
HAVING sum(w.cnt)=length(w.word);

Fiddle with all possible 3+ letters words (485) for that image: http://sqlfiddle.com/#!1/2fc66/1 Fiddle with 699 words out of which 485 are correct: http://sqlfiddle.com/#!1/4f42e/1

Edit 2: We can use array operators like so to get a list of words that contain the letters we want:

SELECT word as sub from
(select word,generate_series(1,length(word)) as s from good_words) as q
GROUP BY word
HAVING array_agg(substring(word,s,1)) <@ ARRAY['t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s'];

So we can use it to narrow down the list of words we need to check.

WITH words AS 
(SELECT word, substring(word,s,1) as sub from
(select word,generate_series(1,length(word)) as s from 
(
  SELECT word from
(select word,generate_series(1,length(word)) as s from good_words) as q
GROUP BY word
HAVING array_agg(substring(word,s,1)) <@ ARRAY['t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s']
)as q) as q)
SELECT DISTINCT w.word FROM
(
SELECT word,words.sub,count(DISTINCT s) as cnt FROM
(SELECT s, substring(array_to_string(l, ''),s,1) as sub FROM
(SELECT l, generate_subscripts(l,1) as s FROM 
 (SELECT ARRAY['t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s'] as l) 
 as q) 
as q) as let JOIN
words ON let.sub=words.sub
GROUP BY words.word,words.sub) as let
JOIN
(select word,sub,count(*) as cnt from words
 GROUP BY word, sub)
as w ON let.word=w.word AND let.sub=w.sub AND let.cnt>=w.cnt
GROUP BY w.word
HAVING sum(w.cnt)=length(w.word) ORDER BY w.word;

http://sqlfiddle.com/#!1/4f42e/44

We can use GIN indexes to work on arrays so we probably could create a table that would store the arrays of letters and make words point to it (act, cat and tact would all point to array [a,c,t]) so probably that would speed things up but that's up for testing.

Jakub Kania
  • 15,665
  • 2
  • 37
  • 47
  • Wow, trying to understand this by testing snippets in the SQL Fiddle... The SQL statement `with words as (select ....)` - does this create a temporary table called `words`? And uses it in a `join`? – Alexander Farber Mar 05 '13 at 14:08
  • 1
    @AlexanderFarber Yes it does. It's a CTE (http://www.postgresql.org/docs/8.4/static/queries-with.html). – Jakub Kania Mar 05 '13 at 14:21
1

Create a table that has entries (id, char), be n the number of characters you are querying for.

select id, count(char) AS count from chartable where (char = x or char = y or char = z ...) and count = n group by id;

OR (for partial matching)

select id, count(char) AS count from chartable where (char = x or char = y or char = z ...) group by id order by count;

The result of that query has all the word-id's that fit the specifications. Cache the result in a HashSet and simple do a lookup whenever a word is entered.

Sebastian van Wickern
  • 1,699
  • 3
  • 15
  • 31
1

Does not work in 8.4. Probably 9.1+ only. SQL Fidlle

select word
from (
    select unnest(string_to_array(word, null)) c, word from good_words
    intersect all
    select unnest(string_to_array('TESTREROREMASDSS', null)) c, word from good_words
) s
group by word
having
    array_agg(c order by c) = 
    (select array_agg(c order by c) from unnest(string_to_array(word, null)) a(c))
Clodoaldo Neto
  • 118,695
  • 26
  • 233
  • 260
1

You can add the column with sorterd letters formatted like '%a%c%t%'. Then use query:

 select * from table where 'abcttx' like sorted_letters

to find words that can be built from letters 'abcttx'. I don't know about performance, but simplicity probably can't be beaten :)

maniek
  • 7,087
  • 2
  • 20
  • 43
  • It would catch `act` but not `cta`. Try `select 'abcttx' like '%c%t%a%'` – Clodoaldo Neto Mar 05 '13 at 19:50
  • 1
    @ClodoaldoNeto yes, that's why You must sort the letters before storing them in the column (so You never store '%c%t%a' there). Also the letters in the query should be sorted – maniek Mar 05 '13 at 19:53
  • It is very simple when the word is already sorted. Did you try to sort a string in 8.4? And insert the `%`? I tried and I had to build a not that trivial plpgsql function. – Clodoaldo Neto Mar 06 '13 at 00:36
  • 1
    @ClodoaldoNeto a small hack: `select '%'||array_to_string(array(select regexp_split_to_table('cat','.*?') order by 1), '%')||'%'` – maniek Mar 06 '13 at 01:26
  • You should include that in your answer as the OP probably can't figure it out. About the performance the only thing I can guess is that a `like` expression starting with a `%` is bad. Only the OP can benchmark it all. I'm doing all these comments on your answer just because it caught my attention and indeed I like it very much. I just don't vote it up as my policy is not to vote on answers where I also answered. – Clodoaldo Neto Mar 06 '13 at 12:10
1

Here is a query that finds the answers that can be found by walking through adjacent fields.

with recursive
input as (select '{{"t","e","s","e"},{"r","e","r","o"},{"r","e","m","a"},{"s","d","s","s"}}'::text[] as inp),
dxdy as(select * from (values(-1,-1),(-1,0),(-1,1),(0,1),(0,-1),(1,-1),(1,0),(1,1)) as v(dx, dy)),
start_position as(select * from generate_series(1,4) x, generate_series(1,4) y),
work as(select x,y,inp[y][x] as word from start_position, input
union
select w.x + dx, w.y + dy, w.word || inp[w.y+dy][w.x+dx]   
   from dxdy cross join input cross join work w 
   inner join good_words gw on gw.word like w.word || '%'
)
select distinct word from work
where exists(select * from good_words gw where gw.word = work.word)

(other answers don't take this into account).

Sql fiddle link: http://sqlfiddle.com/#!1/013cc/14 (notice You need an index with varchar_pattern_ops for the query to be reasonably fast).

maniek
  • 7,087
  • 2
  • 20
  • 43
  • +1 Thank you! But I think it has same problem as vyegorov's suggestion: first you generate all possible letter combinations (which are a lot, esp. for larger boards - and many of them are not valid) and then match the `good_words`. It seems to me more effective to start from the other end: go through the `good_words` and (somehow, which is the subject of my question) try to match the letters in the board. – Alexander Farber Mar 05 '13 at 22:22
  • 1
    Notice there is pruning in that it discards generated words for which there is no prefix in good_words. But I just tried it on a real word list, and it is very slow, so not really usable. See my other answer :) – maniek Mar 05 '13 at 23:13
0

My own solution is to create an insert trigger, which writes letter frequencies into an array column:

create table good_words (
        word varchar(16) primary key,
        letters integer[26]
);

create or replace function count_letters() returns trigger as $body$
    declare
        alphabet varchar[];
        i integer;
    begin

        alphabet := regexp_split_to_array('abcdefghijklmnopqrstuvwxyz', '');
        new.word := lower(new.word);

        for i in 1 .. array_length(alphabet, 1)
        loop
                -- raise notice '%: %', i, alphabet[i];
                new.letters[i] := length(new.word) - length(replace(new.word, alphabet[i], ''));
        end loop;
        return new;
    end;
$body$ language plpgsql;

create trigger count_letters
    before insert on good_words
    for each row execute procedure count_letters();

Then I generate similar array for the random board string tesereroremasdss and compare both arrays using the array contains operator @>

Any new ideas or improvements are always welcome!

Alexander Farber
  • 21,519
  • 75
  • 241
  • 416