2

I've a huge file primary composed of book metadata (author, title, date, url). My problem is that I want to operate on author names (which are often repeated: an author can have hundreds of records) and I want to operate on the subset of these authors that have more than X records.

For example, I have 200 records related to "William Shakespeare", but only one 1 record of "John Black", etc. The point is, being this a classic power law, I have hundred thousands authors, the majority of them with 1-2 records.

Using "Text facet" > "count" is impossible, because my computer freezes.

Is there a query to have the text facet of just some records, based on their count?

Aubrey
  • 507
  • 4
  • 20
Lara M.
  • 855
  • 2
  • 10
  • 23
  • 1
    Did you try to use custom text facet ? Like first of all you remove blanks ( Facet > Customized Facets> Facet by Blank) and then customized text facet ( Facet > Customized Text Facet). And if it's a problem of memory, I recommend you cut in half the files and process them in batch. – iMitwe Nov 02 '16 at 12:01
  • Yes, I tried. I already allocated more memory, but I need the entire file, anyway, for other operations. – Lara M. Nov 02 '16 at 12:32

1 Answers1

4

Create a custom text facet with the following GREL expression (replace COLUMNS_NAME by your actual column name):

facetCount(value, "value", "COLUMN_NAME") > 100

You can edit the comparison (in the example every count great than 100).

To display only exact count match you need to use two == like this:

facetCount(value, "value", "COLUMN_NAME") == 100

More details on this video + tutorail on facet by facet count

Thad Guidry
  • 579
  • 4
  • 8
magdmartin
  • 1,712
  • 3
  • 20
  • 43