4

I would like to use dplyr and RMySQL to work with my large data. There is no issues with dplyr code. The problem (I think) is about exporting data out of MySQL to R. My connection is dropped every time even I am using n=Inf in collect. Theoretically, my data should have more than 50K rows, but I can only get around 15K back. Any suggestions are appreciated.

Approach 1

library(dplyr)
library(RMySQL)

# Connect to a database and select a table 
my_db <- src_mysql(dbname='aermod_1', host = "localhost", user = "root", password = "")   
my_tbl <- tbl(my_db, "db_table") 
out_summary_station_raw <- select(my_tbl, -c(X, Y, AVERAGE_CONC))
out_station_mean_local <- collect(out_summary_station_raw)

Approach 2: using Pool

library(pool)
library(RMySQL)
library(dplyr)

pool <- dbPool(
  drv = RMySQL::MySQL(),
  dbname = "aermod_1",
  host = "localhost",
  username = "root",
  password = ""
)

out_summary_station_raw <- src_pool(pool) %>% tbl("aermod_final") %>% select(-c(X, Y, AVERAGE_CONC))

out_station_mean_local <- collect(out_summary_station_raw, n = Inf)

Warning message (both approaches):

Warning messages:
1: In dbFetch(res, n) : error while fetching rows
2: Only first 15,549 results retrieved. Use n = Inf to retrieve all. 

Update:

Checked log and it looks fine from the server side. For my example, the slow-log said Query_time: 79.348351 Lock_time: 0.000000 Rows_sent: 15552 Rows_examined: 16449696, but collect just could not retrieve the full data. I Am able to replicate the same move using MySQL Bench.

TTT
  • 4,354
  • 13
  • 73
  • 123
  • Are you sure this is an R problem and not a MySQL problem? – shadowtalker Jun 12 '17 at 17:52
  • @ssdecontrol, it is possible. Or more precisely, how to configure a few MySQL parameters through R? For instance, I increased `DBMS connection read time out` in `Workbench`, but not sure what is the equivalent way through R. – TTT Jun 12 '17 at 19:09
  • I think the only way to configure the client-side timeout is to change the relevant option in your `my.cnf` file. – shadowtalker Jun 12 '17 at 19:20
  • @ssdecontrol, thanks for your suggestion. added `interactive_timeout = 60000 wait_timeout = 60000 net_read_timeout = 60000 connect_timeout = 60000` to `my.ini` and verified those variables are in use. But still could get full data... – TTT Jun 12 '17 at 19:38
  • OK, great. you also have to consider the _server_-side timeout, if any. Maybe you can check the debug output from `mysqld` for clues. Otherwise, this question doesn't seem answerable since I've never had this problem and don't know of a way to reproduce it easily. – shadowtalker Jun 12 '17 at 19:47
  • @ssdecontrol, checked log and it looks fine from the server side. For my example, the `slow-log` said `Query_time: 79.348351 Lock_time: 0.000000 Rows_sent: 15552 Rows_examined: 16449696`, but `collect` just could not retrieve the full data. Am able to replicate similar activity in `MySQL bench` – TTT Jun 12 '17 at 20:17

2 Answers2

2

As discussed here https://github.com/tidyverse/dplyr/issues/1968, https://github.com/tidyverse/dplyr/blob/addb214812f2f45f189ad2061c96ea7920e4db7f/NEWS.md and https://github.com/tidyverse/dplyr/commit/addb214812f2f45f189ad2061c96ea7920e4db7fthis package issue seems to be addresed.

What version of the dplyr package are you using?

Linda
  • 627
  • 4
  • 14
  • Thanks for your suggestion. Just upgraded to 0.7.0 from 0.5.0 but still got the same error. `attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] pool_0.1.0 RMySQL_0.10.11 DBI_0.6-1 dplyr_0.7.0 loaded via a namespace (and not attached): [1] magrittr_1.5 assertthat_0.1 R6_2.2.1 tools_3.3.2 glue_1.0.0 [6] tibble_1.3.3 Rcpp_0.12.9 rlang_0.1.1 dbplyr_1.0.0 ` – TTT Jun 13 '17 at 16:41
1

After the most recent RMySQL update, I noticed that I could not collect() data from large views, and I reported it as an issue. Your problem might be related.

One thing to try is to roll back to the last version.

devtools::install_version("RMySQL", version = "0.10.9", 
                          repos = "http://cran.us.r-project.org")
TJ Mahr
  • 3,846
  • 1
  • 21
  • 22