how to reverse engineer Google's entity ids

Question

Google is using entities everywhere nowadays and they are usually prefixed with /m/ and /g/ (but I have also seen some /t/ lately)

I am wondering how the numbering works. For /m/ there is a schema similar to what an url shortener would do. Define an alphabet (in case of /m/ this is 32 characters "0123456789bcdfghjklmnpqrstvwxyz_" and convert a number to a "short url"

e.g. /m/0 4swd <-> 156524 ("/m/0" seems to be a kind of a prefix)

I am stuck with /g/ IDs though. I created a reasonable alphabet from the IDs I have seen "0123456789bcdfghjklmnpqrstvwxyz_" but I can not get it to work.

Since Google is doing some converting itself so I have one real example: /g/11b6377dzp <-> 576462201963131861

from this: Google Search

But I still can not figure this out.

I am mostly interested in the process how to get a handle on this reverse engineering problem (and of course the result). Any ideas?

They're opaque identifiers, so I don't understand the value in converting them from base 36 (or whatever) to base 10. What do you think you're gaining? The /m prefixed identifiers can be looked up in Wikidata or the Freebase data dumps to convert them to a name. — Tom Morris, May 06 '19 at 19:04
For one I am simply curious. Why can /m/ be converted between base 32 and base 10 and why wouldn't that be the case with /g/An immediate "win" would be to be able to construct an URL with a base 10 and be able to follow an arbitrary topic. — Valentin, May 06 '19 at 19:09
I am also interested in the design choices for those algorithms. Why are certain letters skipped and an underscore added. This seems to so random. Why are there different letters skipped in the two alphabets? — Valentin, May 06 '19 at 19:15

score 1 · Accepted Answer · answered May 06 '19 at 21:42

You provided the same alphabet for both cases, but your question implies that they are different. That aside, here's a description of the two encoding schemes.

Quoting from the Freebase developer wiki, here's the encoding for a machine ID:

The keys of machine-generated ids are short variable-length sequences of characters consisting of digits, lower-case letters excluding vowels, and underscore. ... (By avoiding vowels, we hope to avoid accidently [sic] generating offensive identifiers.) Mids are also URL-safe, i.e. they don't require any escaping or unescaping to be used in URLs.

The Google Knowledge Graph IDs are in a separate namespace with the prefix "/g/1" as you noticed and their format, according to the relevant Wikidata property page is

\/g\/1[0-9a-np-z][0-9a-np-z_]{6,8}

so the radix varies by position (no leading underscore allowed) and they chose to only exclude the confusable letter 'o', not all vowels, apparently preferring more encoding space despite the risk of "naughty words."

how to reverse engineer Google's entity ids

1 Answers1

Linked