efficient storage of a chess position

Question

I've read tons of web hits related to this issue, and I still haven't come across any definitive answer.

What I'd like to do is to make a database of chess positions, capable of identifying transpositions (generally which pieces are on which squares).

EDIT: it should also be capable to identify similar (but not exactly identical) positions.

This is a discussion almost 20 years ago (when space was an issue): https://groups.google.com/forum/#!topic/rec.games.chess.computer/wVyS3tftZAA

One of the discussants talk about encoding pieces on a square matrix, using 4 x 64 bits plus some bits more for the additional information (castling, en passant etc): there are six pieces (Pawn, Rook, Knight, Bishop, Queen, King) plus an empty square, that would be 3 bits (2^3), and one more bit for the color of the piece.

In total, there would be 4 numbers of 64bits each, plus some additional information.

Question: is there any other, more efficient way of storing a chess position?

I should probably mention this question is database centric, not game centric (i.e. my sole interest is to efficiently store and retrieve, not to create any AI or to generate any moves).

Thanks, Adrian

jlahd · Answer 1 · 2014-01-29T08:57:25.080

6

There are 32 pieces on the board, and 64 squares. Square index can be represented with a 6-bit number, so to represent the locations of each piece you need 32 six-bit numbers, or a total of 192 bits, which is less than 4x64.

You can do a bit better by realizing that not all positions are possible (e.g. a pawn cannot reach the end row of its own color) and using less than six bits for the position in these cases. Also, a position already occupied by another piece makes that position unavailable for other pieces.

As a piece may also be totally missing from the board, you should start with the kings' positions, as they are always there - and then, encoding another piece's position as the same of a king would mean that the piece has been taken.

Edit:

A short analysis of the pieces' possible positions:

Kings, queens, knights and rooks can be anywhere on the board (64 positions)
Bishops are restricted to 32 positions each
Pawns are restricted to 21, 26, 30, 32, 32, 30, 26, and 21 positions (columns A-H).

Thus, this set of legal chess positions can be described trivially with an integer from zero up to (64^12 * 32^4 * 21^4 * 26^4 * 30^4 * 32^8)-1, or 391935874857773690005106949814449284944862535808450559999, which fits into 188 bits. Encoding and decoding a position to and from this is very straightforward - however, there are multiple numbers that decode into the same position (e.g. white knight 1 at B1 and white knight 2 at G1; and white knight 1 at G1 and white knight 2 at B1).

Due to the fact that no two pieces can occupy the same square, there is a tighter limit but it is a bit difficult to both encode and decode, so probably not useful in a real application. Also, the number shown above is pretty close to 2^188, so I don't think even this tighter encoding would fit into 187 bits.

edited Jan 29 '14 at 08:57

answered Jan 27 '14 at 07:18

jlahd

6,257
1
15
21

Hm... that might be. Should I understand that piece-centric storage is always more efficient than board-centric? – Adrian Jan 27 '14 at 12:05
Could be - cannot prove that, though. The encoding described in the latest edit seems pretty tight to me, at least - that is, not much redundant information stored. – jlahd Jan 29 '14 at 08:58
The efficiency of piece centric versus board centric depends on how many pieces there are in the play versus board size. Chess has 32 pieces and board has 64 locations. If I am not mistaken pice-centric endcoding is cheaper as long this holds: SIZE * log PIECES > PIECES * log SIZE – Panu Jul 12 '14 at 18:13
Each 64 squares can only have 13 possible pieces (not 32)... black's "rhbqkp", white's "RHBQKP" or an empty square all of which can be presented either as that character or an additional bit value max "1100" with 0 not yet included. Couple of excludes mentioned above for pawns. Bishop does not matter cause any squares can be occupied by bishop. Although a bishop is 32 position restrict, there are 2 bishops that can occupy any squares. – Ace Caserya Nov 10 '15 at 03:01
@AlvinCaseria: I am encoding the *positions* of the distinct *pieces*, not the pieces on each position - that's the whole point of my post. If you encode the 13 possible values for 48 squares and 12 for the first and last rows, you end up with 12^16*13^48 combinations, which is about 10^14 times what you get with the encoding I presented. – jlahd Nov 10 '15 at 11:30
If you are storing positions for a chess database (from the OP), you need to store whose turn it is, castling rights, and en passant file. Using a piece-centric representation you can store this info as "alternative pieces" instead of separately although I haven't calculated if this is better. – qwr Sep 11 '22 at 02:22

score 4 · Answer 2 · answered Jul 12 '14 at 17:53

If you do not need a decodable position representation for comparisons then you could look at Zobrist hashing. This is used by chess engines to produce a 64 bit oneway hash of a position for spotting transpositions in search trees. As it is a oneway hash you obviously cannot reverse the position from the hash. The size of the hash is tunable, but 64 bits seems to be the accepted minimum size that results in few collisions. It would be ideal as a database index key with a fixed length of just 8 bytes. As collisions, though infrequent, are possible you could do a second pass comparing the actual positions to filter out any positions that have hashed to the same value if it is a concern. I use Zobrist hashes in one of my own applications (using SQLite) that I use to manage my openings and it has no trouble in finding transpositions.

Hashing would be close to perfect if only transpositions would be needed... but yes indeed the position needs to be decoded. Similar, but not exactly identical position searching would be nice to have, but it seems harder and harder to achieve. — Adrian, Jul 13 '14 at 21:24

score 3 · Answer 3 · answered Jul 12 '14 at 18:15

Take a loot at the Forsyth–Edwards Notation (FEN). It is described here. It is also well known and supported by many engines and chess programs.

Here is the FEN for the starting position:

rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1

Fen is seperated in 6 segments.

Segment 1 contains the pieces. black pieces are in lower case, white pieces are in upper case.

Segment 2 states, who's turn is it. (w or b)

Segment 3 is for castling. KQkq means both can castle on both sides. K = King side white q = queen side black

Segment 4 En passant target square in algebraic notation. If there's no en passant target square, this is "-". If a pawn has just made a two-square move, this is the position "behind" the pawn. This is recorded regardless of whether there is a pawn in position to make an en passant capture

Segment 5 Halfmove clock: This is the number of halfmoves since the last capture or pawn advance. This is used to determine if a draw can be claimed under the fifty-move rule.

Segment 6 Fullmove number: The number of the full move. It starts at 1, and is incremented after Black's move.

I know of course about FEN, but this surely isn't an efficient way to store a position. Imagine what it would cost to search a database of tens of millions of FEN positions for a specific similar (but not exactly identical) position...! — Adrian, Jul 13 '14 at 20:59
actually like FEN concept except for segment 1. Revised Segment 1 (8*8*13) + Seg2 (2) + Seg3 (15 or 4 bits) + Seg4 (8*8 indicating pass pawn position) + Seg5and6 (100 or more counting half as 1 value) — Ace Caserya, Nov 10 '15 at 03:02

Gaetan · Answer 4 · 2021-07-22T18:25:47.187

3

Quick in-house suggestion in 22 bytes (inspired from Huffman encoding). Not trivial to decode/encode, but not difficult either. Serious programs probably have better tricks.

Initially we have 32 empty squares, and for each color 8P, 2k, 2B, 2R, 1K, 1Q;
A huffman encoding for those will use 1, 3, 5, 5, 5, 6, 6 bits; For instance empty=0, white P=100, k=10100, B=10101, R=10110, K=101110, Q=101111, black same as white but starting with 11 instead of 10;
So our 64 symbols (1 per square) will use: 32 + 2 x (8x3 + 5x2 + 5x2 + 5x2 + 6x1 + 6x1) = 164 bits.

Now for the special cases:

1 bit for the color at play;
castling: to code a rook that can form a castle, let's re-use the pawn symbol (no confusion: no pawn is on the first line). That's 3 bits instead of 5, so we do not lose here!.
in passing : just mention the column, if any. That's 7 choices or less. So 3 bits (000 for no in-passing, anything else to specify a column);
promotion: for each one roughly 2 pawns vanish to be replaced with a free square and a queen, so (-3 - 3 + 1 + 6) = 1 more bit in the worst case;

Hence without promotion: 164 + 1 + 3 = 168 bits = 21 bytes; With provision for the very hypothetical case of 8 queen promotions: 22 bytes;

That's the best I can do for a in-house solution.

edited Jul 22 '21 at 18:25

answered Jun 26 '21 at 15:37

Gaetan

66
2

Thanks Gaetan, one minor commend: there may be two en-passant pawns, for instance if black has two pawns on b4 and e4, and white moves c2-c4 then both black pawns may capture en-passant. Otherwise, there is a tough trade-off between compact storage and ease of decoding. Such a database should search for identical (or similar) setups in potentially billions of stored positions. The less time it takes for a position to be identified, the better but in this case storage space might be larger. Not easy... – Adrian Jun 29 '21 at 23:26
Hello Adrian, congrats for still commenting 7 years later. The en-passant pawn to be taken is alone, so 3 bits are still enough: the rest of the position tells whether 1 or 2 enemies can take it. For backstory: I was stuck at 24 bytes some years ago. So last month when the idea about castling came up unlooked-for, I was happy about the «saved» bits, became curious about how things were in the real world of chess programming, found this old question and left the description just in case it interested anyone (I am not programming anything chess-related right now). – Gaetan Jul 12 '21 at 10:42
Oh, this keeps coming back to me from time to time, I haven't actually started to code but it's an interesting problem. The thing is, such a storage needs to satisfy two things at once: 1) easy to decode in order to minimize search time, and 2) fit into a multiple of 64 bits. Your proposal fits into three 64-bit integers (which is awesome), but Huffman encoding does just that: it encodes, and de-encoding considerably increases search time if presented with potentially billions of stored positions. It has to be both minimal and fast, and lately space seems to be cheaper than speed. – Adrian Jul 13 '21 at 07:34
I have just tought that combining KP's idea with mine gets us down to 24 bytes (8 for occupied squares + 32 half-bytes for the pieces) with no variable-length encoding whatsoever. For the normal pieces (P,R,k,B,Q,K) of each color we can use 2x6=12 values out of 16 possible half-byte values. The 4 remaining values can be the king at play or a en-passant pawn of any color. And a rook available for castling can still use the pawn value. (editing and over-editing comments: I will just let that one be the last version) – Gaetan Jul 14 '21 at 21:35
I don't quite follow, Gaetan. There are 32 pieces on the board, which means 2 * 16 half-byte values for both colors. That makes 32 * 4 = 128 bits, or two 64-bit integers. Whether storing a rook as a pawn, or as a plain rook, it still takes 4 bits per piece... so you made me curious: where do the 12 values come from, again? I understand there are 6 unique pieces for each color, but they still have to be stored in 16 different squares, isn't it? – Adrian Jul 15 '21 at 13:48
Thinking more about this, encoding using 32*4 bits or encoding using your 1, 3, 5, 5, 5, 6, 6 bits is still encoding. Your proposal allows for exact matching (identical positions). However, looking for "similar" positions assumes de-encoding each stored position and that would be really slow. For maximal speed I think we need 34 columns in the database, totaling 41 bytes: one 64-bit integer (that is, 8 bytes) to specify empty or not empty squares, 32 separate bytes for the pieces, and 1 byte for the information about turn and ply-count. That needs no de-encoding whatsoever, just fast querying. – Adrian Jul 15 '21 at 14:04
About your comment from July 15th: As in KP's idea we use first a 64-bit integer to tell where the pieces are, then at most 32 half-byte to tell what they are. A normal piece is one of 12 "values" : white P,R,k,B,Q,K or black P,R,k,B,Q,K, with repetitions for pieces of the same kind. For the special cases (turn, en-passant) I have the 16-12=4 unused values of a half-word; The kind of request you want seems to fit a database better. So the discussion moves to efficient storage of indexed databases, which is a huge (much more general) subject. – Gaetan Jul 22 '21 at 18:21
Yes, that is entirely correct. The original post does mention the interest is "database centric". That is, efficiently store and retrieve positions (from a database). – Adrian Jul 23 '21 at 18:42

KP Thomassen · Answer 5 · 2021-06-22T11:36:42.747

Compact and reasonably understandable, worst case (or always if that is convenient) 25 bytes for a complete position specification:

64 bits to identify which fields are empty and which aren't. The number of 'on' bits defines how many pieces (of the next max. 32) are actually encodeded next.
Use upto 32 × 4 bits to identify each piece (including color). They are mentioned in board order. Castling rights and en passant information can also be encoded into this (a rook that still can castle gets a different encoding than one thar can't, and a pawn that just moved up 2 positions gets its own encoding):

000 regular pawn; 001 pawn that just moved up two positions; 010 knight; 011 bishop; 100 rook that has never moved, nor has its king; 101 rook that has moved, or its king; 110 queen; 111 king.

Additional information needs 1 byte:

who's turn is it: 1 bit,
ply count since last pawn move or capture, (upto 100): 7 bits

Thanks KP, this seems to be close to a minimum, although I'm not sure how you've taken into account the en-passant pawns. Regardless of that, on 64 bit architectures your 200 bits (25 bytes) uses 3.125 integers. In the database however, if not possible to store using 3 * 64 = 192 bits, I would be forced to use 4 * 64 = 256 bits. This seems to be the required amount of space needed, with plenty of bits left to store whatever additional information. — Adrian, Jun 24 '21 at 19:52

score 1 · Answer 6 · answered Jan 27 '14 at 07:42

1

You could use a modified run length encoding where each piece is encoded as a piece number (3 bits), with 0y111 used to skip ahead spaces. As there are many situations where pieces are next to each other, you end up omitting the positional information:

         All pieces are followed by color bit
0y000c 0 Pawn
0y001c 1 Rook
0y010c 2 Knight
0y011c 3 Bishop
0y100c 4 Queen
0y101c 5 King
0y110 6 Empty space
0y111 7 Repeat next symbol (count is next 6 bits, then symbol)

The decoder starts off at a1 and proceed to the right, moving up at the end of a row, so the encoding for a starting board would be:

12354321      Literal white encoding from a1 to h1    32 bits
7 8 0         repeat white pawn 8 times               13 bits
7 32 6        repeat 32 empty spaces                  12 bits
7 8 8         repeat black pawn 8 times               13 bits
9abcdba9      Literal encoding of black               32 bits
                                                    ---------
                                                     102 bits total

That being said, the additional complexity and uncertainty of a variable length encoding is probably not worth the space savings. Further, it may be worse than a constant width format in certain plays.

answered Jan 27 '14 at 07:42

Mitch

21,223
6
63
86

Yes, that is the main issue here: space saving and fast processing of the similar positions. I read many about variable length encoding, there is even a Huffman procedure/algorithm to do that, but I don't know how it works in real life practice (in terms of space saved and computer post processing) – Adrian Jan 27 '14 at 09:38
Only way to find out is to test it. You could further optimize it by assuming a starting board, and only record positions changed. – Mitch Jan 27 '14 at 14:31
Hi Mitch, I've read of course about all sorts of optimisations (some crazy ones) but any such optimisation involves a lot of post-processing work to (for example) identify similar positions. In a database with hundreds of millions of positions, this would probably be an overkill. Database speed is just as important as storage space, for my purposes... – Adrian Jan 28 '14 at 09:26
That sounds like a Huffman encoding, this is run length encoding. The only optimization it requires is that you don't record a symbol until you find one which is not a repeat. (http://en.wikipedia.org/wiki/Run-length_encoding) It is pretty much the simplest form of compression out there. – Mitch Jan 28 '14 at 15:01
I agree tht RLE is a less complicated encoding, but it is still an "encoding". Thr FEN itself is a semi-RLE for empty squares and (same as RLE) it would be perfect for identifying exactly identical positions. However, if the goal is to identify "similar" (but not identical) positions, it requires at least two steps: decoding and comparing... and that take a lot of computing time. What I am looking for is the perfect "mix" of minimal space with maximum speed. – Adrian Jan 29 '14 at 21:26
You asked "more efficient way of storing a chess position", if you want answers to "how can I compare the difference in two chess boards to produce a distance," you should ask another question. My preliminary answer would be to match n-grams. – Mitch Jan 29 '14 at 23:45
Oh, you're absolutely right... I didn't mention this crucial aspect in my original question (now edited). Thanks, I'll look at n-grams, it sounds like a very advanced Bayesian statistics application, but it's an interesting idea. – Adrian Jan 31 '14 at 08:55
Mitch I was just wondering, the starting position is sort of a "simple scenario" where one can repeat the empty squares 32 times. How about the "worse case scenario" where each 32 pieces are perfectly scattered around the board, all of them separated by an empty square? How many bits would maximally be required? – Adrian Jan 31 '14 at 15:31
In that case it would be no worse than a literal encoding. Although you _can_ repeat, you do not have to. You would end up with 32 piece symbols, and 32 space symbols. Both can be 1 nibble each, and you would end up with 64 nibbles at 256 bits - not much worse than the 192 bits with piece based encoding. – Mitch Jan 31 '14 at 23:21
Yes, but then 4 numbers of 64 bits each is much easier to store, added to which the big plus of having (the much faster) bitwise operations available. I have a hunch that searching millions of positions bitwise is orders of magnitude faster than comparing strings. – Adrian Feb 02 '14 at 09:20
Again, storage is different from analysis. Unless you are comparing every board to every other board (complexity `O((n*(n+1)/2)` which for 64000 boards is 2 trillion comparisons), you will need a second index to find boards to compare, which means your problem may not be efficient comparison of at-rest boards states. – Mitch Feb 02 '14 at 19:48
Of course, they're different but both very important. Nope, don't need to compare every board to every other board, just one board to find similarities with in millions of other boards. I hope this is now more clearly explained, I should have done that from the very beginning. – Adrian Feb 03 '14 at 22:54
Ok, but if you are looking for all similar boards, you are still performing the search I referenced. If you are only looking for boards similar to a single starting board, your complexity is `O(n)`, but that is still quite high for large values of `n`. Indexes can take that down to `O(log(n))`. If each comparison is 1ms with an `n` of 1E6, you can search of a single predicate board in 6 ms versus 16 minutes. See http://en.wikipedia.org/wiki/Database_index – Mitch Feb 03 '14 at 23:18
A database index is definitely needed, and I was even thinking about a bitmap index if nothing more efficient comes up. For the moment I can think of nothing else more quick than bitwise operations, but... this is the point of this thread: to find better alternatives if they exist. It all depends, of course, on the additional storage required by the index. The first reaction is to use those 4 numbers as a primary key in the database, but other indexes could prove useful for the database lookup. – Adrian Feb 05 '14 at 09:29
This will be my last post, but my whole point is that those are two vastly different requirements. That which is quick to query is not usually efficiently stored; that which is compact, is not typically quick to query. Ask another question if you want answers to "How do I find similar chess positions", which is itself a complicated problem. As your current question is written, I would have voted to close as it is too broad as it has no single factual answer. – Mitch Feb 06 '14 at 00:19
To back that up, consider the problems inherent in answering your question: storage, schema, indexes, comparison. The storage, I believe is answered above. Schema, is a best fit for DBAs, but is almost certainly not best accomplished with a 256 bit primary key if speed is a factor. Indexing is dependent upon your comparison, which is dependent upon your definition of similar. I postulate there is a great deal of dissention over what is and is not similar in chess, but there are even multiple mathematical definitions. Eg: Transpositions vs edits vs attack similarity vs move similarity – Mitch Feb 06 '14 at 00:26
Thank you Mitch, you've been very helpful. It is a complicated problem indeed, but I got a lot of insight from your posts. – Adrian Feb 07 '14 at 06:55

rka · Answer 7 · 2019-03-03T01:08:09.470

1

Consider up to 32 pieces on the board. Each piece can be on one of the 64 squares. Representing the piece positions independendly in predetermined order requires 32*6=192 bits.

In addition to that each pawn can be promoted to a rook, bishop, knight or queen, so for each pawn we need to encode its state in 3 additional bits (4 possible piecetypes and normal pawn).

32*6+16*3 = 240 bit/ 30 byte

In many cases you will need additional information about the state of the game/variant the position arises in:

EnPassent File: 4 bits (8 files and none)

Castlerights: 4 bits (short/long white/black)

sideToMove : 1 bit (white/black)

which adds up to 249 bit/32 bytes.

This might not be the most compact representation, but it is easy to en-/decode.

edited Mar 03 '19 at 01:08

answered Mar 03 '19 at 00:42

rka

11
2

You don't need to consider what a pawn could promote to for the current position. And if you store the position of each of the 32 pieces, you must have some value representing not on the board. – qwr Sep 11 '22 at 02:40
@qwr You could use store the king's position first, and use the king's position as not on the board, since any piece cannot occupy the same square as the king – AspectOfTheNoob Jan 31 '23 at 13:43

score 0 · Answer 8 · edited Jun 20 '20 at 09:12

My two cents

Short version

Choose data format that makes it easy to count the similarity of two positions.
Store positional data near the searching program (possibly in memory).
Brute force the search through all the positions when searching similar positions.
Possibly divide the search to multiple threads/processes.

Longer version

32 bytes (4*64 bits) is quite small amount of data. 1000 million chess positions could fit in to 30 gigabytes. 192 bits is 24 bytes this would make in to 23 gigabytes. Probably database use some kind of compression and thus the in disk might be less than these figures. I don't know what kind of limits there are for storage, but because these seems quite tight encodings it might not be worth the effort try to minimize more.

Because ability to find similar positions was required I think the encoding should make it easy to compare different positions. Preferably this could be counted without decoding. For this to work the encoding should probably be constant length (can't think easy way to do this with variable length coding).

Indexing might speed up similarity search. Naive approach would be index all the positions by piece locations in database. This would make 32 indexes (and maybe for additional information also). It would make the search lightning fast at least in theory.

Indexes are going to take quite much space. Probably more than the actual positional data. And still they might not help that much. For example finding positions where black king is in, or near e4 required 9 searches using the index and then hopping around the 30 gigabytes of positional information which is likely need disk access in random locations. And probably finding similar positions is done for more than one piece...

If the storage format is efficient it might just be enough to brute force (like this)through all positional data and check the similarity position by position. This will use the CPU caches efficiently. Also because of the constant length record it is easy to divide the work to multiple processors or machines.

Whether to use piece-centric or board-based storage format depends on how you are going to calculate the similarity of two positions compared to each others. Piece-centric gives easy way to calculate distance of one piece in two different positions. However in piece-centric approach every piece is identified separately thus it is not so easy to find a pawn in certain location. One has to check every pawns location. If the piece identity is not so important, then board-based storage makes it easy to just check if a pawn is in wanted location. On the other hand it is not possible to check which exact pawn there is.

Hi Panu, thanks very much for this. The article you mention is inspiring, especially for "Brute force works if you have a brute problem (and a lot of force)". However that article refers to the performance when operating on all the data in the RAM, which is not likely to happen for a chess database (dividing the work on multiple nodes would improve performance, though). Will certainly keep in mind your longer version, thanks for the cents (they're valuable!) — Adrian, Jul 13 '14 at 21:16

score 0 · Answer 9 · answered Feb 11 '17 at 13:04

There are two simple ways to store the information of the board: either by storing the locations of each piece or by storing for each square what is in it.

As Mitch explains, there is a way to compress a little bit using RLE, however the examples given is the start position, which is particularly simple to describe. In another case where pieces are spread on the board, you could have space and pieces alternating, and RLE would not compress anything. So unless a more complex algorithm is used, you're back to no compression.

I think jlahd made a mistake in the computation by counting twice the center pawns, so that in fact the space required to store the location for each piece is not 188 bits but 168 bits. To that you need to store as well if the pawns have been promoted. So in fact, for pawns there are (32 + 4x64)^16 possibilities. That's total of 223 bits = 28 bytes.

If instead we store for each square its content, we need to count the possibilities for on square. For most squares, there are 6 possible white pieces and same for black. For the top and bottom row, one color of pawn cannot appear. So that is 13 possibilites for center squares and 12 possibilites for top and bottom squares, so 13^48 x 12^16 possibilites. The location of the en-passant is 17 possibilites. So that's about 240 bits.

To conclude, it seems you can gain 12.5% space by storing pieces positions instead of the content of each square.

efficient storage of a chess position

9 Answers9

Short version

Longer version

Linked