2

There is a lovely chunk of code in TidyText Mining Section 3.3 that I am trying to replicate in my own dataset. However, in my data, I cannot get ggplot to 'remember' that I want the data in descending order, and that I want a certain top_n.

I can run the code from TidyText Mining and I get the same charts that the book shows. However, when I run this on my own dataset, the facet wraps do not show the top_n (they seem to show a random number of categories) and data within each facet is not sorted by descending order.

I can replicate this problem with some random text data and the full code - but I also can replicate the problem with mtcars - which really confuses me.

I expect the following chart to show me mpg in descending order for each facet, and for each facet to only give me the top 1 category. It does neither for me.

require(tidyverse)

mtcars %>%
  arrange (desc(mpg)) %>%
  mutate (gear = factor(gear, levels = rev(unique(gear)))) %>%
  group_by(am) %>%
  top_n(1) %>%
  ungroup %>%
  ggplot (aes (gear, mpg, fill = am)) +
  geom_col (show.legend = FALSE) +
  labs (x = NULL, y = "mpg") +
  facet_wrap(~am, ncol = 2, scales = "free") + 
  coord_flip()

But what I really want is to have a chart like this sorted as in the TidyText book (data for example only).

require(tidyverse)
require(tidytext)

starwars <- tibble (film = c("ANH", "ESB", "ROJ"),
                  text = c("It is a period of civil war. Rebel spaceships, striking from a hidden base, have won their first victory against the evil Galactic Empire. During the battle, Rebel spies managed to steal secret plans to the Empire's ultimate weapon, the DEATH STAR, an armored space station with enough power to destroy an entire planet. Pursued by the Empire's sinister agents, Princess Leia races home aboard her starship, custodian of the stolen plans that can save her people and restore freedom to the galaxy.....",
                           "It is a dark time for the Rebellion. Although the Death Star has been destroyed, Imperial troops have driven the Rebel forces from their hidden base and pursued them across the galaxy. Evading the dreaded Imperial Starfleet, a group of freedom fighters led by Luke Skywalker has established a new secret base on the remote ice world of Hoth. The evil lord Darth Vader, obsessed with finding young Skywalker, has dispatched thousands of remote probes into the far reaches of space....",
                           "Luke Skywalker has returned to his home planet of Tatooine in an attempt to rescue his friend Han Solo from the clutches of the vile gangster Jabba the Hutt. Little does Luke know that the GALACTIC EMPIRE has secretly begun construction on a new armored space station even more powerful than the first dreaded Death Star. When completed, this ultimate weapon will spell certain doom for the small band of rebels struggling to restore freedom to the galaxy...")) %>%
  unnest_tokens(word, text) %>%
  mutate(film = as.factor(film)) %>%
  count(film, word, sort = TRUE) %>%
  ungroup()

total_wars <- starwars %>%
  group_by(film) %>%
  summarize(total = sum(n))

starwars <- left_join(starwars, total_wars)

starwars <- starwars %>%
  bind_tf_idf(word, film, n)

starwars %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  group_by(film) %>%
  top_n(10) %>%
  ungroup %>%
  ggplot(aes(word, tf_idf, fill = film)) +
  geom_col(show.legend = FALSE) +
  labs (x = NULL, y = "tf-idf") +
  facet_wrap(~film, ncol = 2, scales = "free") +
  coord_flip()
JMacKay
  • 45
  • 5
  • What are you expecting in the first few lines of your `mtcars` code? If you group by `am` and take the highest `mpg`, you have a 2 row data frame because there are only 2 values of `am`. Is this your intention? – camille May 16 '18 at 17:52
  • Hi Camillle 14 - yes that's the intention - the data frame should (and does) sort by mpg and however many you request, but this doesn't seem to get passed to ggplot in any of my datasets (but will for the larger data example in the TidyText book) – JMacKay May 17 '18 at 08:32

1 Answers1

3

I believe what is tripping you up here is that top_n() defaults to the last variable in the table, unless you tell it what variable to use for ordering. In the examples in our book, the last variable in the dataframe is tf_idf so that is what is used for ordering. In the mtcars example, top_n() is using the last column in the dataframe for ordering; that happens to be carb.

You can always tell top_n() what variable you want to use for ordering by passing it as an argument. For example, check out this similar workflow using the diamonds dataset.

library(tidyverse)

diamonds %>%
  arrange(desc(price)) %>%
  group_by(clarity) %>%
  top_n(10, price) %>%
  ungroup %>%
  ggplot(aes(cut, price, fill = clarity)) +
  geom_col(show.legend = FALSE, ) +
  facet_wrap(~clarity, scales = "free") + 
  scale_x_discrete(drop=FALSE) +
  coord_flip()

Created on 2018-05-17 by the reprex package (v0.2.0).

These example datasets are not perfect parallels because they don't have one row per combination of characteristics in the way that the tidy text data frames do. I am pretty sure the issue with top_n() is the problem, though.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
  • Hi Julia! I hugely appreciate your input - and I'm a little bit fangirly that you noticed me ;) Specifying `top_n(n, variable)` *does* give me the number of words I'm expecting within each facet, but my final ggplot chart still doesn't 'arrange' by descending tf_idf. So in my random `starwars` example specifying `top_n(10, word)` gives us the top 10 words, but ROJ still ranks 'has' third on the list even though it has the smallest tf_idf. I expect this is a very obvious fix that I am missing, so I really appreciate your help! :) – JMacKay May 22 '18 at 08:36
  • Hi again - I think I'm beginning to get to grips with this after playing about a little more. I had assumed that `top_n` was selecting the number of words to display in each facet, which I now realise it is not. So back to the starwars example, if I specify `top_n(5, tf_idf)` it is in fact telling ggplot to select the top 5 tf_idf . . . but my resulting chart shows many words. So I think my question is actually: *how can I select how many words are shown within each facet?* I wonder if I have misunderstood how your code works in the book? – JMacKay May 22 '18 at 09:15
  • A final update! I went back to play with your code in your example and I realised something. You ask for `top_n(15)` and I assumed that within each facet I was getting 15 words. I have now carefully counted and within the facet 'Northanger Abbey' and 'P&'P' there are in fact 16 words. So the code wasn't doing what I thought it did at all. I will mark this as solved as I realise now I wasn't understanding the original example - many thanks! – JMacKay May 22 '18 at 09:34
  • Ah, I think I may understand what is going on now here. When there are **ties** (i.e. the exact same tf-idf score, which can happen with small datasets like all 6 Jane Austen novels or the Star Wars movies), then `top_n()` does not break the the ties; it keeps all of the items at that rank. – Julia Silge May 24 '18 at 20:01
  • Yes that's what I concluded too - I just didn't read the code properly in your example! Thanks so much for your help :) – JMacKay May 25 '18 at 09:04