Openrefine: text facet by counting

Question

I've a huge file primary composed of book metadata (author, title, date, url). My problem is that I want to operate on author names (which are often repeated: an author can have hundreds of records) and I want to operate on the subset of these authors that have more than X records.

For example, I have 200 records related to "William Shakespeare", but only one 1 record of "John Black", etc. The point is, being this a classic power law, I have hundred thousands authors, the majority of them with 1-2 records.

Using "Text facet" > "count" is impossible, because my computer freezes.

Is there a query to have the text facet of just some records, based on their count?

Did you try to use custom text facet ? Like first of all you remove blanks ( Facet > Customized Facets> Facet by Blank) and then customized text facet ( Facet > Customized Text Facet). And if it's a problem of memory, I recommend you cut in half the files and process them in batch. — iMitwe, Nov 02 '16 at 12:01
Yes, I tried. I already allocated more memory, but I need the entire file, anyway, for other operations. — Lara M., Nov 02 '16 at 12:32

score 4 · Accepted Answer · edited Nov 03 '16 at 13:58

Create a custom text facet with the following GREL expression (replace COLUMNS_NAME by your actual column name):

facetCount(value, "value", "COLUMN_NAME") > 100

You can edit the comparison (in the example every count great than 100).

To display only exact count match you need to use two == like this:

facetCount(value, "value", "COLUMN_NAME") == 100

More details on this video + tutorail on facet by facet count

Openrefine: text facet by counting

1 Answers1