It seems quite strange to me as it should only store word frequencies.
I think you are misapprehending what is stored and how it is stored. The Lucene documentation for the index file formats explains in detail. Quoting from the Overview section:
Each segment index maintains the following:
Field names. This contains the set of field names used in the index.
Stored Field values. This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are
used to store auxiliary information about the document, such as its
title, url, or an identifier to access a database. The set of stored
fields are what is returned for each hit when searching. This is keyed
by document number.
Term dictionary. A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also
contains the number of documents which contain the term, and pointers
to the term's frequency and proximity data.
Term Frequency data. For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the
term in that document if omitTf is false.
Term Proximity data. For each term in the dictionary, the positions that the term occurs in each document. Note that this will not exist
if all fields in all documents set omitTf to true.
Normalization factors. For each field in each document, a value is stored that is multiplied into the score for hits on that field.
Term Vectors. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector
consists of term text and term frequency. To add Term Vectors to your
index see the Field constructors
Deleted documents. An optional file indicating which documents are deleted.
Some of the above are optional and probably won't be present in your indexes. However, a minimal index will have the "field names", "stored field values", "term dictionary", and "term frequency data".
Some of these data structures scale according to the number of distinct words in your corpus. Others scale according to the number of documents, or the number of unique words per document.
If you populate an index with a single (relatively) small document, then some of the scaling factors will be working against you.
Finally, the physical representations of the index segments will be designed and optimized primarily for fast searching in preference to reduced storage space. That will affect the "information density" ... and the storage space used in practice.