1

I have two setups: one runs on windows 10 (ntfs partition), the other on debian (ext4 partition). The R source code is the same. The main process starts 8 child-processes (P-SOCKS) -- on 8 vcores -- which all query and write to the same WAL enabled sqlite database.

On windows 10 I get 100 % CPU load distributed over all the processes. On debian, I hardly get 25 % CPU load. Monitoring the processes on debian I think writes are the bottlenecks since I see only one process hitting 100 % on its vcore at a time. (The others are probably waiting to write.)

Each connection is using PRAGMA busy_timeout = 60000; and PRAGMA journal_mode = WAL;.

I am trying to debug this. I have tried PRAGMA synchronous = OFF; thinking it could have something to do with fsync(), but I don't see any improvement. Any other suggestions to what might be causing the low performance on debian?

Edit: Write-Cache seems to be enabled on the SCSI disk (checked with sdparm) and tweaking ext4 mount options such as barrier=0 and data=writeback do not seem to have any effect.

Benchmarking

Here is some simple code for benchmarking concurrent writes:

make.con <- function() {
  con <<- DBI::dbConnect(RSQLite::SQLite(), dbname = 'db.sqlite')
  DBI::dbExecute(con, 'PRAGMA journal_mode = WAL;')
  DBI::dbExecute(con, 'PRAGMA busy_timeout = 60000;')
  DBI::dbExecute(con, '
    CREATE TABLE IF NOT EXISTS tmp (
      id INTEGER NOT NULL,
      blob BLOB NOT NULL,
      PRIMARY KEY (id)
  )')
}
make.con()


fn <- function(x) {
  set.seed(x)
  # read
  random.blob.read <- RSQLite::dbGetQuery(con, 'SELECT blob FROM tmp WHERE id = (SELECT abs(random() % max(tm.id)) FROM tmp tm);')
  # write
  blob <- serialize(list(rand = runif(1000)), connection = NULL, xdr = FALSE)
  RSQLite::dbExecute(con, 'INSERT INTO tmp (blob) VALUES (:blob);', params = list('blob' = list(blob)))
}

n <- 30000L

parallel::setDefaultCluster(parallel::makeCluster(spec = 2L))
parallel::clusterExport(varlist = 'make.con')
invisible(parallel::clusterEvalQ(expr = {make.con()}))

microbenchmark::microbenchmark(
  lapply(1:n, fn),
  parallel::parLapplyLB(X = 1:n, fun = fn, chunk.size = 50L), 
  times = 2L
)

parallel::stopCluster(cl = parallel::getDefaultCluster())

Code simply reads and writes blobs to the database. First, do some dummy runs and allow the database to increase to a few GB.

On my windows 10 laptop I get these results (6GB database):

Unit: seconds
                                                       expr      min       lq     mean   median       uq      max neval
                                            lapply(1:n, fn) 26.02392 26.02392 26.54853 26.54853 27.07314 27.07314     2
 parallel::parLapplyLB(X = 1:n, fun = fn, chunk.size = 50L) 15.73851 15.73851 16.44554 16.44554 17.15257 17.15257     2

I clearly see 1 vcore at 100 %, then 2 vcores at 100%. The performance is almost twice as fast which shows that 2 concurrent processes do not block each other.

On debian I get this:

Unit: seconds
                                                       expr      min       lq     mean   median       uq      max neval  
                                            lapply(1:n, fn) 39.96850 39.96850 40.14782 40.14782 40.32714 40.32714     2
 parallel::parLapplyLB(X = 1:n, fun = fn, chunk.size = 50L) 43.34628 43.34628 44.85910 44.85910 46.37191 46.37191     2

The two vcores never max out. Also, there is no performance improvement when using 2 processes -- it is even worse as they seem to block each other. And lastly, debian is on better (albeit virtualized) hardware.

1 Answers1

1

Confirmed on Ubuntu 18.04, haven't tested on Windows.

I have simplified your example and added instrumentation code. The first plot shows the number of blobs written for each subprocess. In the first plot, the plateaus indicate inactivity on all cores for about 0.2 seconds, and the steep rises are burst writes across all cores. The second plot shows the raw data, most useful with plotly which doesn't work in a StackOverflow answer.

Enabling gc() makes the runs longer but spreads the load more evenly, second plot below.

I have no idea what's going on. Can you replicate and experiment further with this setup? I'd appreciate feedback here or perhaps in the RSQLite issue tracker.

Basic run, without gc()

make.con <- function() {
  options(digits.secs = 6)

  con <<- DBI::dbConnect(RSQLite::SQLite(), dbname = "db.sqlite")
  DBI::dbExecute(con, "PRAGMA journal_mode = WAL;")
  DBI::dbExecute(con, "PRAGMA busy_timeout = 60000;")
  DBI::dbExecute(con, "PRAGMA synchronous = OFF;")
  DBI::dbExecute(con, "
    CREATE TABLE IF NOT EXISTS tmp (
      id INTEGER NOT NULL,
      blob BLOB NOT NULL,
      PRIMARY KEY (id)
  )")
}
make.con()
#> [1] 0

blob <- serialize(list(rand = runif(1000)), connection = NULL, xdr = FALSE)

fn <- function(x) {
  time0 <- Sys.time()
  rs <- DBI::dbSendQuery(con, "INSERT INTO tmp (blob) VALUES (:blob);")
  time1 <- Sys.time()
  DBI::dbBind(rs, params = list("blob" = list(blob)))
  time2 <- Sys.time()
  DBI::dbClearResult(rs)
  time3 <- Sys.time()
  # gc()
  time4 <- Sys.time()
  list(pid = unix::getpid(), time0 = time0, time1 = time1, time2 = time2, time3 = time3, time4 = time4)
}

n <- 1000L

parallel::setDefaultCluster(parallel::makeCluster(8L))
parallel::clusterExport(varlist = c("make.con", "blob"))
invisible(parallel::clusterEvalQ(expr = {
  make.con()
}))

data <- parallel::parLapply(X = 1:n, fun = fn, chunk.size = 50L)

parallel::stopCluster(cl = parallel::getDefaultCluster())

library(tidyverse)

tbl <-
  data %>%
  transpose() %>%
  map(unlist, recursive = FALSE) %>%
  as_tibble() %>%
  rowid_to_column() %>%
  pivot_longer(-c(rowid, pid), names_to = "step", values_to = "time") %>%
  mutate(time = as.POSIXct(time, origin = "1970-01-01")) %>%
  mutate(pid = factor(pid)) %>%
  arrange(time)

tbl %>%
  group_by(pid) %>%
  mutate(cum = row_number()) %>%
  ungroup() %>%
  ggplot(aes(x = time, y = cum, color = pid)) +
  geom_line()

p <-
  tbl %>%
  ggplot(aes(x = time, y = factor(pid), group = 1)) +
  geom_path() +
  geom_point(aes(color = step))

p

plotly::ggplotly(p)

(plotly doesn't work on StackOverflow)

Created on 2020-01-30 by the reprex package (v0.3.0)

Results with gc()

krlmlr
  • 523
  • 1
  • 5
  • 17