1

I'd like to import data from a large postgresql table. In order to save space, I'd like to automatically convert textual values to factors.

For instance, the dataset has many string variables such as (eg., "Male," "Female") and if these could be imported as factors, I'd be able to load the data set using a command like,

df <- dbGetQuery(con, "select id, gender from large.table"))

Instead of receiving rows like (#, "Male"), I want rows like (#, 0) so that I could save memory.

If you try the below query on a database of your choice and let the "gender" column equal a character column, you should see that the size of df_large is much larger.

df <- dbGetQuery(con, "select id, gender from large.table"))

df_large <- df$gender
print(object.size(df_large), units="Kb")

df_small <- factor(df$gender)
print(object.size(df_small), units="Kb")
d_a_c321
  • 533
  • 1
  • 11
  • 23
  • 1
    I've never used RPostgreSQL, but basically every other R db connection I've used does return character columns as factors by default. Check `str(df)` to verify that those columns really are character columns rather than factors. – joran Sep 23 '13 at 18:59
  • 2
    At one time it was true that factors saved space but improvements in R mean that its not true any more. – G. Grothendieck Sep 23 '13 at 19:27
  • I doubt that there is an easy way to do this without modifying the function written for `fetch` which is in C. The `RS_PostgreSQL_fetch` function returns a `data.frame` with characters. Without modifying that code, you will always be loading the data as a `character`, and then modifying to a `factor`. Additionally, the memory savings may not be as large as you suspect; this might not be a fruitful avenue to save memory. – nograpes Sep 23 '13 at 19:27
  • 1
    There's always `select id, case gender when 'Male' then 0 else 1 end from large.table` I suppose. That will get you numbers out of the database but the comments above suggest that you're worrying about a non-issue. – mu is too short Sep 23 '13 at 19:33
  • @G.Grothendieck -- In my data, which is low cardinality, the factored variable was 2-3 times smaller. mu.is.too.short -- it turns out that memory is an issue and there are too many columns to make recoding feasible will be tricky (especially when the values are unknown). – d_a_c321 Sep 23 '13 at 21:44
  • @dchandler. We can't really tell what you did to make that claim since you did not provide any reproducible code to backup the measurement claim but we do know that R uses a global cache for strings so duplicated strings should normally only take up memory for one copy: http://cran.r-project.org/doc/manuals/R-ints.html#The-CHARSXP-cache – G. Grothendieck Sep 23 '13 at 22:02
  • @G.Grothendieck -- I modified my question and added code for benchmarking. The database I'm pulling from is under VPN, but the problem will hopefully be replicable. – d_a_c321 Sep 23 '13 at 22:28
  • 1
    OK. I tried it and the theory does not seem to work. Do you have enough memory to just read it into R and then convert the character columns to factor? `is.char <- sapply(DF, is.character); DF[is.char] <- lapply(DF[is.char], factor)` – G. Grothendieck Sep 23 '13 at 23:07
  • That's puzzling, I'm unsure why the problem isn't reproducible. I'm using the latest R, 3.0.1 (Good Sport). In the worst case scenario, I agree that converting the columns back to factors is a good solution. Due to memory, I may have to import the data in pieces. – d_a_c321 Sep 24 '13 at 16:03
  • Converting strings to factors doesn't save any memory. – hadley Feb 27 '15 at 21:18

1 Answers1

0

Two points.

The first is that if this is an issue then CASE is your answer

 SELECT id, case when gender = male then 1::int else 0 end as is_male from large_table;

The second is that if memory is a problem then to be honest you probably don't want to import a huge number of rows from large_table and analyse it in the client. You probably want to analyse incrementally in the database. Without knowing what you are doing, it is hard to say. However you may want to look into aggregates, windowing functions, and possibly PL/R.

Chris Travers
  • 25,424
  • 6
  • 65
  • 182