10

How to find out the entropy of the English language by using isolated symbol probabilities of the language?

skaffman
  • 398,947
  • 96
  • 818
  • 769
wcwchamara
  • 111
  • 1
  • 1
  • 7
  • Is it with `space` or without space. With space we have total 27 characters. Without 26. So the isolated probability should be 1/26 or 1/27. – Prasad Rajapaksha Mar 09 '12 at 09:03
  • @PrasadRajapaksha see e.g https://cs.stanford.edu/people/eroberts/courses/soco/projects/1999-00/information-theory/entropy_of_english_9.html – jtlz2 Jan 08 '21 at 06:43

1 Answers1

18

If we define 'isolated symbol probabilities' in the way it's done in this SO answer, we would have to do the following:

  1. Obtain a representative sample of English text (perhaps a carefully selected corpus of news articles, blog posts, some scientific articles and some personal letters), as large as possible

  2. Iterate through its characters and count the frequency of occurrence of each of them

  3. Use the frequency, divided by the total number of characters, as estimate for each character's probability

  4. Calculate the average length in bits of each character by multiplying its probability with the negative logarithm of that same probability (the base-2 logarithm if we want the unit of entropy to be bit)

  5. Take the sum of all average lengths of all characters. That is the result.

Caveats:

  • This isolated-symbols entropy is not what is usually referred to as Shannon's entropy estimate for English. Shannon based the entropy on conditional n-gram probabilities, rather than isolated symbols, and his famous 1950 paper is largely about how to determine the optimal n.

  • Most people who try to estimate the entropy of English exclude punctuation characters and normalise all text to lowercase.

  • The above assumes that a symbol is defined as a character (or letter) of English. You could do a similar thing for entire words, or other units of text.

Code example:

Here is some Python code that implements the procedure described above. It normalises the text to lowercase and excludes punctuation and any other non-alphabetic, non-whitespace character. It assumes that you have put together a representative corpus of English and provide it (encoded as ASCII) on STDIN.

import re
import sys
from math import log

# Function to compute the base-2 logarithm of a floating point number.
def log2(number):
    return log(number) / log(2)

# Function to normalise the text.
cleaner = re.compile('[^a-z]+')
def clean(text):
    return cleaner.sub(' ',text)

# Dictionary for letter counts
letter_frequency = {}

# Read and normalise input text
text = clean(sys.stdin.read().lower().strip())

# Count letter frequencies
for letter in text:
    if letter in letter_frequency:
        letter_frequency[letter] += 1
    else:
        letter_frequency[letter] = 1

# Calculate entropy
length_sum = 0.0
for letter in letter_frequency:
    probability = float(letter_frequency[letter]) / len(text)
    length_sum += probability * log2(probability)

# Output
sys.stdout.write('Entropy: %f bits per character\n' % (-length_sum))
Community
  • 1
  • 1
jogojapan
  • 68,383
  • 11
  • 101
  • 131
  • What is an approximate answer for English? – jtlz2 Jan 08 '21 at 06:40
  • 1
    @jtlz2 Applying the code above to 5.9 million words of English news text, I am getting 4.126024 bits per character. Again, that is the entropy as defined in the explanation above. If you compare to other values, make sure the definition of entropy is the same. – jogojapan Jan 09 '21 at 11:05
  • Yes I was expecting 2.2 so obviously v much definition dependent. Thanks for doing that :) – jtlz2 Jan 09 '21 at 12:21
  • How would you calculate an uncertainty? – jtlz2 Jan 09 '21 at 13:24
  • I wonder if that's enough to distinguish dialects of English – jtlz2 Jan 09 '21 at 13:25
  • 1
    You are asking about uncertainty, but I am not sure how to interpret that. Some people describe entropy as a measure of uncertainty. Regarding dialects of English.... the bigger challenge might be to get dialect text in written / digitized form. You could get news text in British vs. American English, but I doubt it's possible to get it in Midlands English or with an Arkansas accent. For Britsh vs. American I would guess that with Shannon's entropy (which takes bigrams into account), distinction might be possible, but with a purely single-character one that seems unlikely. – jogojapan Jan 09 '21 at 16:13