0

I'm currently working on a project and I want to summarize a column from a joined table twice. SQL code is this:

SELECT M.date,T.team_long_name AS Home_Team, M.home_team_goal, Te.team_long_name AS Away_Team, M.away_team_goal
FROM Match AS M JOIN Team AS T
ON T.team_api_id = M.home_team_api_id
JOIN Team AS Te
ON Te.team_api_id = M.away_team_api_id
WHERE match_api_id = 539848;

...and the result is this:
...and the result is this

Database tables are as shown here:
Database tables are as shown here

I hope that I have provided all the information needed.

Question: How can I have the same result in R by only using dplyr library?

Table names and structure for the first 10 rows as below:

Match:

structure(list(id = 1:10, country_id = c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), league_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), season = c("2008/2009", "2008/2009", "2008/2009", 
"2008/2009", "2008/2009", "2008/2009", "2008/2009", "2008/2009", 
"2008/2009", "2008/2009"), stage = c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 10L), date = c("2008-08-17 00:00:00", "2008-08-16 00:00:00", 
"2008-08-16 00:00:00", "2008-08-17 00:00:00", "2008-08-16 00:00:00", 
"2008-09-24 00:00:00", "2008-08-16 00:00:00", "2008-08-16 00:00:00", 
"2008-08-16 00:00:00", "2008-11-01 00:00:00"), match_api_id = c(492473L, 
492474L, 492475L, 492476L, 492477L, 492478L, 492479L, 492480L, 
492481L, 492564L), home_team_api_id = c(9987L, 10000L, 9984L, 
9991L, 7947L, 8203L, 9999L, 4049L, 10001L, 8342L), away_team_api_id = c(9993L, 
9994L, 8635L, 9998L, 9985L, 8342L, 8571L, 9996L, 9986L, 8571L
), home_team_goal = c(1L, 0L, 0L, 5L, 1L, 1L, 2L, 1L, 1L, 4L), 
    away_team_goal = c(1L, 0L, 3L, 0L, 3L, 1L, 2L, 2L, 0L, 1L
    )), row.names = c(NA, 10L), class = "data.frame")

Team:

structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 614L, 1034L), team_api_id = c(9987L, 
9993L, 10000L, 9994L, 9984L, 8635L, 9991L, 9998L, 7947L, 9985L, 
8203L, 8342L, 9999L, 8571L, 4049L, 9996L, 10001L, 9986L, 9997L, 
9989L), team_long_name = c("KRC Genk", "Beerschot AC", "SV Zulte-Waregem", 
"Sporting Lokeren", "KSV Cercle Brugge", "RSC Anderlecht", "KAA Gent", 
"RAEC Mons", "FCV Dender EH", "Standard de Liège", "KV Mechelen", 
"Club Brugge KV", "KSV Roeselare", "KV Kortrijk", "Tubize", "Royal Excel Mouscron", 
"KVC Westerlo", "Sporting Charleroi", "Sint-Truidense VV", "Lierse SK"
)), row.names = c(NA, 20L), class = "data.frame")

In the desired result I used match_api_id = 539848 but as it is not included in this sample data, use one of your own choice.

The main issue is to be able to have team_long_name twice in the result but for different teams, matching by their team_api_id 's.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 2
    You know, Dimitris, I was having a look at this and realized that I don't really want to type in all sorts of data. If you have sample data as `data.frame`s (perhaps 1-2 countries, 1-2 leagues, 2-3 teams, and *just enough* player info to support this), then it would be good to include that sample data here (via `dput`). It looks like you already have a database somewhere, so it should be straight-forward to query it *manually* into data. From there, I (for one) would probably create a temporary sqlite database for dev, demo, and an answer. – r2evans May 15 '20 at 00:04
  • I edited my first post but the question is still closed. Should I wait for someone responsible to open it again or should I post a new one? – Dimitris Papantzikos May 15 '20 at 15:20
  • 1
    Dimitris, I don't want to leave you hanging too long ... but cannot do this justice in a comment. Start with a `con`nection to the db, then `tbl_match <- tbl(con, "Match"); tbl_team <- tbl(con, "Team")`, then something like `filter(tbl_match, match_api_id == 492477) %>% inner_join(select(tbl_team, home_team_api_id = team_api_id, Home_Team = team_long_name), by = "home_team_api_id") %>% inner_join(select(tbl_team, away_team_api_id = team_api_id, Away_Team = team_long_name), by = "away_team_api_id") %>% select(date, Home_Team, home_team_goal, Away_Team, away_team_goal)`. – r2evans May 15 '20 at 16:43
  • Oooh I didn't know I could use a select function inside the inner join. It worked ! I really appreciate it man thank you so much ! Edit : I tried the first solution you posted, it's ok! I had already done that extra step you explained. – Dimitris Papantzikos May 15 '20 at 16:52
  • And I'm thankful to him. As I said many times I'm really new to the platform so any help is appreciated ! I didn't even know that I have to tag you and the other two to reopen the question. – Dimitris Papantzikos May 15 '20 at 17:08
  • So now I suppose that I "Answer my question" so that others will be able to see it ? – Dimitris Papantzikos May 15 '20 at 17:14
  • Oh ok I will wait for him then, that's why I asked in the first place. Thank you! – Dimitris Papantzikos May 15 '20 at 17:24
  • @IanCampbell ... does *"sweet reward of reputation"* come with an espresso shot? Chocolate? Wine? Dang it ... okay. – r2evans May 15 '20 at 17:36

1 Answers1

2

Up front, the dbplyr pipe:

tbl_match <- tbl(fakedb, "Match")
tbl_team <- tbl(fakedb, "Team")
tbl_match %>%
  filter(match_api_id == 492477) %>%
  inner_join(select(tbl_team, home_team_api_id = team_api_id, Home_Team = team_long_name),
             by = "home_team_api_id") %>%
  inner_join(select(tbl_team, away_team_api_id = team_api_id, Away_Team = team_long_name),
             by = "away_team_api_id") %>%
  select(date, Home_Team, Away_Team) %>%
  collect()

Edited to include collect(), since without it the output is not a proper frame and/or may not include all relevant data.

from the corresponding DBI call:

DBI::dbGetQuery(fakedb, some_long_query)

Backfill from your sample data. Note that your data is inconsistent and incomplete, so I had to make some assumptions/translations. For instance, your first structure, which I'm inferring is Match, does not match the schema as depicted in your picture: it includes extra columns like season and *_team_goal. Also, your queried match_api_id of 539848 is not in the sample data, so I used one that was present. (In the future, I suggest that your code and sample data should be consistent with regards to things like this.)

Code to generate a fake databsae for the purposes of this answer. Starting with your two structures as Match and Team.

library(dbplyr)
library(dplyr)
fakedb <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(fakedb, Match)
copy_to(fakedb, Team)

some_long_query <- '
  SELECT
    M.date, T.team_long_name AS Home_Team, M.home_team_goal,
    Te.team_long_name AS Away_Team, M.away_team_goal
  FROM
    Match AS M
    JOIN Team AS T ON T.team_api_id = M.home_team_api_id
    JOIN Team AS Te ON Te.team_api_id = M.away_team_api_id
  WHERE
    match_api_id = 492477;' # 539848

DBI::dbGetQuery(fakedb, some_long_query)
#                  date     Home_Team home_team_goal         Away_Team away_team_goal
# 1 2008-08-16 00:00:00 FCV Dender EH              1 Standard de Liège              3


tbl_match <- tbl(fakedb, "Match")
tbl_team <- tbl(fakedb, "Team")
tbl_match %>%
  filter(match_api_id == 492477) %>%
  inner_join(select(tbl_team, home_team_api_id = team_api_id, Home_Team = team_long_name),
             by = "home_team_api_id") %>%
  inner_join(select(tbl_team, away_team_api_id = team_api_id, Away_Team = team_long_name),
             by = "away_team_api_id") %>%
  select(date, Home_Team, home_team_goal, Away_Team, away_team_goal) %>%
  collect()
# A tibble: 1 x 5
#   date                Home_Team     home_team_goal Away_Team         away_team_goal
#   <chr>               <chr>                  <int> <chr>                      <int>
# 1 2008-08-16 00:00:00 FCV Dender EH              1 Standard de Liège              3
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    I had no idea you could make fake databases so easily. I'll keep this in mind. – Ian Campbell May 15 '20 at 17:32
  • It's in-memory so really only good for demonstration and testing. And that's precisely what we needed here :-) – r2evans May 15 '20 at 17:33
  • 1
    1) I supposed that the columns I included were the only ones needed to solve this issue and as I see it's true. 2) The schema is incomplete due to lack of space and because I wasn't instructed to include all columns except those needed as foreign keys to be able to see the table connections. Although, I'm sorry for that. 3) About the match_api_id : I said that in my edited question : "In the desired result I used match_api_id = 539848 but as it is not included in this sample data, use one of your own choice." – Dimitris Papantzikos May 15 '20 at 17:42
  • 1
    Good points, thanks for the clarification. BTW, I needed to add `%>% collect()` at the end to get a *proper* return value. If you look at the returned value *without* `collect()`, you would notice that it actually contains no data, just a lazy reference to it. *With* `collect()`, you have actual data. – r2evans May 15 '20 at 17:47
  • I used this to learn a little bit about `dbplyr`; not that this is a complex example (you kept it simple with your sample data, thank you), but just going through basic operations like this demonstrates how "simple" translating from `dplyr` to `dbplyr` can be. Since most of my real-database-tables have 1M rows or more, I've been hesitant to test it for fear of accidentally collecting (downloading) way more than I needed. Thanks for the impetus and clear example to test it more simply/safely. – r2evans May 15 '20 at 17:49
  • 1
    That's also true. Again, thanks a lot and you're welcome ! – Dimitris Papantzikos May 15 '20 at 17:51