How to combine multiple csv files into one big file without loading the actual file into the environment?

Question

Is there anyway to combine multiple CSV files together into a super file without using the read.csv/read_csv functions?

I want to combine all the tables (CSV) in the folder into one csv file, since each of them represents a separate month. The folder looks like this:

list.files(folder)

 [1] "2013-07 - Citi Bike trip data.csv" "2013-08 - Citi Bike trip data.csv" "2013-09 - Citi Bike trip data.csv"
 [4] "2013-10 - Citi Bike trip data.csv" "2013-11 - Citi Bike trip data.csv" "2013-12 - Citi Bike trip data.csv"
 [7] "2014-01 - Citi Bike trip data.csv" "2014-02 - Citi Bike trip data.csv" "2014-03 - Citi Bike trip data.csv"
[10] "2014-04 - Citi Bike trip data.csv" "2014-05 - Citi Bike trip data.csv" "2014-06 - Citi Bike trip data.csv"
[13] "2014-07 - Citi Bike trip data.csv" "2014-08 - Citi Bike trip data.csv" "201409-citibike-tripdata.csv"     
[16] "201410-citibike-tripdata.csv"      "201411-citibike-tripdata.csv"      "201412-citibike-tripdata.csv"     
[19] "201501-citibike-tripdata.csv"      "201502-citibike-tripdata.csv"      "201503-citibike-tripdata.csv"     
[22] "201504-citibike-tripdata.csv"      "201505-citibike-tripdata.csv"      "201506-citibike-tripdata.csv"     
[25] "201507-citibike-tripdata.csv"      "201508-citibike-tripdata.csv"      "201509-citibike-tripdata.csv"     
[28] "201510-citibike-tripdata.csv"      "201511-citibike-tripdata.csv"      "201512-citibike-tripdata.csv"     
[31] "201601-citibike-tripdata.csv"      "201602-citibike-tripdata.csv"      "201603-citibike-tripdata.csv"

I tried the following and did get the big data, which is a large list of 33 elements and 3.6 Gbs. However, the full process took a while. Considering the fact that the website is updated monthly, the increasing data size will make the merging process more slowly. Thus, could someone help me combine all the data files together without loading them into the environment? The data source could be found here: https://s3.amazonaws.com/tripdata/index.html.

filenames<- list.files(folder, full.names =TRUE)
data<- lapply(filenames,read_csv)

The data file looks like this, which is not the form I want. I would like to have a big table with all the information merged together.

> head(data)
[[1]]
Source: local data frame [843,416 x 15]

   tripduration           starttime            stoptime start station id      start station name start station latitude
          (int)              (time)              (time)            (int)                   (chr)                  (dbl)
1           634 2013-07-01 00:00:00 2013-07-01 00:10:34              164         E 47 St & 2 Ave               40.75323
2          1547 2013-07-01 00:00:02 2013-07-01 00:25:49              388        W 26 St & 10 Ave               40.74972
3           178 2013-07-01 00:01:04 2013-07-01 00:04:02              293   Lafayette St & E 8 St               40.73029
4          1580 2013-07-01 00:01:06 2013-07-01 00:27:26              531  Forsyth St & Broome St               40.71894
5           757 2013-07-01 00:01:10 2013-07-01 00:13:47              382 University Pl & E 14 St               40.73493
6           861 2013-07-01 00:01:23 2013-07-01 00:15:44              511      E 14 St & Avenue B               40.72939
7           550 2013-07-01 00:01:59 2013-07-01 00:11:09              293   Lafayette St & E 8 St               40.73029
8           288 2013-07-01 00:02:16 2013-07-01 00:07:04              224   Spruce St & Nassau St               40.71146
9           766 2013-07-01 00:02:16 2013-07-01 00:15:02              432       E 7 St & Avenue A               40.72622
10          773 2013-07-01 00:02:23 2013-07-01 00:15:16              173      Broadway & W 49 St               40.76065
..          ...                 ...                 ...              ...                     ...                    ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
  station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)

[[2]]
Source: local data frame [1,001,958 x 15]

   tripduration           starttime            stoptime start station id        start station name start station latitude
          (int)              (time)              (time)            (int)                     (chr)                  (dbl)
1           664 2013-08-01 00:00:00 2013-08-01 00:11:04              449           W 52 St & 9 Ave               40.76462
2          2115 2013-08-01 00:00:01 2013-08-01 00:35:16              254           W 11 St & 6 Ave               40.73532
3           385 2013-08-01 00:00:03 2013-08-01 00:06:28              460        S 4 St & Wythe Ave               40.71286
4           653 2013-08-01 00:00:10 2013-08-01 00:11:03              398  Atlantic Ave & Furman St               40.69165
5           954 2013-08-01 00:00:11 2013-08-01 00:16:05              319       Park Pl & Church St               40.71336
6           145 2013-08-01 00:00:37 2013-08-01 00:03:02              521           8 Ave & W 31 St               40.75045
7           331 2013-08-01 00:01:25 2013-08-01 00:06:56             2000  Front St & Washington St               40.70255
8           194 2013-08-01 00:01:26 2013-08-01 00:04:40              313 Washington Ave & Park Ave               40.69610
9           598 2013-08-01 00:01:40 2013-08-01 00:11:38              528           2 Ave & E 31 St               40.74291
10          360 2013-08-01 00:01:45 2013-08-01 00:07:45              500        Broadway & W 51 St               40.76229
..          ...                 ...                 ...              ...                       ...                    ...
Variables not shown: start station longitude (dbl), end station id (int), end station name (chr), end station latitude (dbl), end
  station longitude (dbl), bikeid (int), usertype (chr), birth year (chr), gender (int)

When you say merge, do you mean merge or simply append? Are you wanting to modify the data at all or just stick it all together? — jamieRowen, May 25 '16 at 17:47
Append them together works, but exclude the replicate lines of the column names. Is there any way to stick them and get rid of the duplicate name rows? @jamieRowen — v15, May 26 '16 at 03:50

score 0 · Answer 1 · edited May 23 '17 at 10:28

0

You don't need to load each csv into R. Combine the csvs outside of R and then load the files all at once. Here's a shell script that'll do the job if you have access to unix commands (solution from here).

nawk 'FNR==1 && NR!=1{next;}{print}' *.csv > master.csv

Or using windows command prompt (solution from here):

@echo off
setlocal
set first=1
>master.csv.tmp (
  for %%F in (*.csv) do (
    if defined first (
      type "%%F"
      set "first="
    ) else more +1 "%%F"
  )
)
move /y master.csv.tmp master.csv >nul

edited May 23 '17 at 10:28

Community

1
1

answered May 25 '16 at 17:47

Edward R. Mazurek

2,107
16
29

Huh, hadn't heard of nawk before, my linux solution was going to be `find -maxdepth 1 -name "*.csv" -exec cat {} \; > master.csv` – jamieRowen May 25 '16 at 17:50
Is there any solution for mac ? Also, I prefer to run the code directly in script because they are actually part of a function. Thanks. – v15 May 25 '16 at 18:05
Here's an idea: [run a bash script from r](http://stackoverflow.com/questions/11395217/run-a-bash-script-from-an-r-script) – Edward R. Mazurek May 25 '16 at 18:13

score 0 · Answer 2 · answered May 26 '16 at 20:20

You have a list of data frames. So if you want to melt those data frames into one big data frame, then do:

dplyr::bind_rows(data)

On the other hand, you can concatenate the CSVs themselves outside of R using cat (as suggested above). But you can call that from within R like this:

setwd(folder)
system("cat *.csv > full.csv")

The only problem is that the column headers will be duplicated for each of the files that you concatenated, which you might not want.

score 0 · Answer 3 · edited Nov 22 '18 at 07:07

0

You can use the CMD and simply write :

C:\yourdirWhereCsvfilesExist\copy *.csv combinedfile.csv

then you will have a single file called combinedfile.csv with all the data

I hope that's will be helpful for you!

edited Nov 22 '18 at 07:07

Agilanbu

2,747
2
28
33

answered Nov 22 '18 at 06:57

Taher Assad

44
5

score 0 · Answer 4 · answered Jan 16 '19 at 13:50

I would use this one:

library(data.table)
multmerge = function(path){
  filenames=list.files(path=path, full.names=TRUE)
  rbindlist(lapply(filenames, fread))
} 
path <- "C:/Users/kkk/Desktop/test/test1"
mergeA <- multmerge(path)
write.csv(mergeA, "mergeA.csv")

That solution was posted under a different thread as a way to merge multiple files

How to combine multiple csv files into one big file without loading the actual file into the environment?

4 Answers4

Linked