2

I am trying to import a dataset (with many csv files) into r and afterwards write the data into a table in a postgresql database.

I successfully connected to the database, created a loop to import the csv files and tried to import. R then returns an error, because my pc runs out of memory.

My question is: Is there a way to create a loop, which imports the files one after another, writes them into the postgresql table and deletes them afterwards? That way I would not run out of memory.

Code which returns the memory error:

`#connect to PostgreSQL database
db_tankdata <- 'tankdaten'  
host_db <- 'localhost'
db_port <- '5432'
db_user <- 'postgres'  
db_password <- 'xxx'
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = db_tankdata, host=host_db, 
                 port=db_port, user=db_user, password=db_password)

#check if connection was succesfull
dbExistsTable(con, "prices")

#create function to load multiple csv files
import_csvfiles <- function(path){
  files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
  lapply(files,read_csv) %>% bind_rows() %>% as.data.frame()
    }


#import files
prices <- import_csvfiles("path...")
dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)`

Thanks in advance for the feedback!

Gonny
  • 37
  • 5
  • All the code pieces work and just have to throttle down to account for RAM, so if the csv are similar size, test how many you can handle at a time and process that way. A handy notation is, for example, `lapply(files, read_csv)[1:10]`, or perhaps `import_csvfiles("path...")[1:10]`. You establish how much you can consume, wrap your import in `horrors` a for loop to achieve the throttling. Pop open a terminal and use `htop` to keep and eye on RAM use, and close out unnecessary programs. Just a way to think about the problem. HTH – Chris Apr 19 '20 at 13:26

2 Answers2

2

If you change the lapply() to include an anonymous function, you can read each file and write it to the database, reducing the amount of memory required. Since lapply() acts as an implied for() loop, you don't need an extra looping mechanism.

import_csvfiles <- function(path){
     files <- list.files(path, pattern = "*.csv",recursive = TRUE, full.names = TRUE)
     lapply(files,function(x){ 
          prices <- read.csv(x) 
          dbWriteTable(con, "prices", prices , append = TRUE, row.names = FALSE)
          })
}
Len Greski
  • 10,505
  • 2
  • 22
  • 33
0

I assume that your csv files are very large that you are importing to your database? According to my knowledge R first want to store the data in a dataframe with the code that you have written, storing the data in memory. The alternative will be to read a CSV file in chunks as you do with Python's Pandas.

When calling ?read.csv I saw the following output:

nrows : the maximum number of rows to read in. Negative and other invalid values are ignored.

skip : the number of lines of the data file to skip before beginning to read data.

Why don't you try to read 5000 rows at a time into the dataframe write to the PostgreSQL database and then do it for each file.

For example, for each file do the following:

number_of_lines = 5000                 # Number of lines to read at a time
row_skip = 0                           # number of lines to skip initially
keep_reading = TRUE                    # We will change this value to stop the while

while (keep_reading) {
    my_data <- read.csv(x, nrow = number_of_lines , skip = row_skip)
    dbWriteTable(con, "prices", my_data , append = TRUE, row.names = FALSE) # Write to the DB

    row_skip = 1 + row_skip + number_of_lines   # The "1 +" is there due to inclusivity avoiding duplicates

# Exit Statement: if the number of rows read is no more the size of the total lines to read per read.csv(...)
if(nrow(my_data) < number_of_lines){
   keep_reading = FALSE
    } # end-if    
} # end-while

By doing this you are breaking up the csv into smaller parts. You can play around with the number_of_lines variable to reduce the amount of loops. This may seem a bit hacky with a loop involved but I'm sure it will work

cnburger
  • 59
  • 1
  • 8
  • Do I have to run the code for each file then ? Because I have over 2.000 csv files to import. I think the answer of Len Greski in my case was the more simple way. Thanks anyway! – Gonny Apr 20 '20 at 11:39
  • Yes, it would then have to be for every csv file, I am sure that it will still be fast. If you have big memory constraints this will be an extra solution breaking up the reading of the csv file into smaller part, and limiting the amount of memory allocated by the `number_of_lines` variable. My pleasure, happy to contribute! – cnburger Apr 21 '20 at 12:41