2

I am trying to create a polars dataframe which is a frequency table of words in a list of words. Something like this:

from collections import defaultdict
word_freq= defaultdict(int)
for word in list_of_words:
    word_freq[word] += 1

Except, instead of a dictionary I would like it to be a polars dataframe with two columns: word, count.

I would also like to know what the best way to convert this dict to a df (in cases where that may be needed).

ste_kwr
  • 820
  • 1
  • 5
  • 21

1 Answers1

1

There is collections.Counter which simplifies this:

from collections import Counter

words = ['foo', 'foo', 'bar', 'baz', 'baz']

counts = Counter(words)
Counter({'foo': 2, 'bar': 1, 'baz': 2})

To create a Dataframe:

pl.DataFrame(list(counts.items()), schema=['word', 'count'])
shape: (3, 2)
┌──────┬───────┐
│ word ┆ count │
│ ---  ┆ ---   │
│ str  ┆ i64   │
╞══════╪═══════╡
│ foo  ┆ 2     │
│ bar  ┆ 1     │
│ baz  ┆ 2     │
└──────┴───────┘

You could also do the counting in polars with .value_counts()

pl.Series('word', words).value_counts()
shape: (3, 2)
┌──────┬────────┐
│ word ┆ counts │
│ ---  ┆ ---    │
│ str  ┆ u32    │
╞══════╪════════╡
│ foo  ┆ 2      │
│ bar  ┆ 1      │
│ baz  ┆ 2      │
└──────┴────────┘
jqurious
  • 9,953
  • 1
  • 4
  • 14