1

I have a list of CSV URLs from the net and have merged them to a vector. Now, I want to read this list with read_csv.

Example:

files <- c("csv_link1.csv",
             "csv_link2.csv",
             "csv_link3.csv",
              and so on....) 
data <- map_dfr(files, read_csv)

This is no problem. The Problem is that in the CSV files there are columns that are filled with different values. So, for example, in CSV1 there is column "V1", which is filled with double and in CSV the same column is "V1", filled with characters. Merging the CSVs does not work because they are different data types.

In my case, I think there are two possibilities to solve this.

  1. I only import certain columns, so I say read_csv only reads column (V2 and V3) but not V1

Or

  1. I merge the columns to the same data type with col_types

I have tried both, but failed because of the correct syntax.

I tried something like

data <- map_dfr(files, read_csv(cols_only(the col names)))

But, this don't work.

How can I import and merge only specific columns?

Concrete Example in my case:

library(data.table)
library(readr)
library(purrr)


files <- c("https://www.football-data.co.uk/mmz4281/1920/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1819/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1718/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1617/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1516/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1415/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1314/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1213/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1112/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1011/EC.csv")


data <- map_dfr(files, read_csv)

Error: Can't combine `BbAH` <character> and `BbAH` <double>.

So i the column BbAH has different data types. But i dont need this column. It would be cool if i can chose the columns which would be merged befor the merge run into an error because of this different data type problem.

pontilicious
  • 239
  • 2
  • 12
  • You can try creating a dictionary of datatypes for each column such that after a csv is read, the columns are converted to the appropriate data type based on the dictionary. Then you can merge the dataframes. – Hasan Bhagat Sep 12 '20 at 16:50
  • Hi @pontilicious, welcome to StackOveflow! To get more specific feedback, it would probably be good to also include some of your data. So maybe import the first two files as their own dataframes and then include the results of dput(head(your_datafram_name_here,10)) to show the first ten rows of your data? You can paste that into questions using the "edit" option on the bottom left horizontal menu just under your question. – Russ Thomas Sep 12 '20 at 17:10
  • @Russ Thomas, thanks for your answer. i updated my question. Hope its okay for reproduction. Greetings :) – pontilicious Sep 12 '20 at 17:33
  • Thanks @pontilicious. So you mentioned you don't need the ```BbAH``` column. Can you specifiy which columns you do need? – Russ Thomas Sep 12 '20 at 18:54
  • @Russ i only need the columns `(Date,Time,HomeTeam,AwayTeam,FTHG,FTAG,FTR)` – pontilicious Sep 12 '20 at 19:29

3 Answers3

2

We can select the columns that you need after reading the csvs and combine them using map_df.

library(tidyverse)
result <- map_df(files, ~read_csv(.x) %>% select(Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • tryed this too but the problem is that `read_csv` runs into an error because he cant combine `Season` and `Season` . After this error the `select` command will aborted – pontilicious Sep 13 '20 at 07:11
  • @pontilicious It doesn't give me any error for the files that you have shared. If you other files and there is type mismatch you can convert all of them to characters. Try `result <- map_df(files, ~read_csv(.x) %>% select(Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR) %>% mutate_all(as.character))` – Ronak Shah Sep 13 '20 at 07:31
1

How about this:

library(data.table)
library(readr)

rbindlist(lapply(files, read_csv, col_types = "character"))

This imports all columns as character, so you'd need to convert them after the merge to be whatever you initially intended.

Count Orlok
  • 997
  • 4
  • 13
1

Since you only needs those seven variables, you can read in those specific variables using fread to avoid the issue with the BbAH variable.

library(data.table)
library(dplyr)
library(purrr)

files <- c("https://www.football-data.co.uk/mmz4281/1920/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1819/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1718/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1617/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1516/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1415/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1314/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1213/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1112/EC.csv",
           "https://www.football-data.co.uk/mmz4281/1011/EC.csv")

# Identify columns you need
myColumns = c("Date","Time","HomeTeam","AwayTeam","FTHG","FTAG","FTR")

# Modified function found in https://stackoverflow.com/a/51348578/8535855
# takes a filename and a vector of columns as input
fread_allfiles <- function(file, columns){
  x <- fread(file, select = columns) %>% 
    select(everything())   # 
  return(x)
}

df_all <- files %>% 
  map_df(~ fread_allfiles(.,myColumns))

head(df_all)

which produces the following format:

         Date  Time     HomeTeam       AwayTeam FTHG FTAG FTR
1: 03/08/2019 12:30    Stockport     Maidenhead    0    1   A
2: 03/08/2019 15:00    Aldershot          Fylde    1    2   A
3: 03/08/2019 15:00       Barnet         Yeovil    1    0   H
4: 03/08/2019 15:00 Chesterfield Dover Athletic    1    2   A
5: 03/08/2019 15:00      Chorley        Bromley    0    0   D
6: 03/08/2019 15:00  Dag and Red         Woking    0    2   A

You can then reformat the Date and Time columns if needed. It looks like on the first file has any values for Time? So the rest are filled in as NA

> str(df_all)
Classes ‘data.table’ and 'data.frame':  5429 obs. of  7 variables:
 $ Date    : chr  "03/08/2019" "03/08/2019" "03/08/2019" "03/08/2019" ...
 $ Time    : chr  "12:30" "15:00" "15:00" "15:00" ...
 $ HomeTeam: chr  "Stockport" "Aldershot" "Barnet" "Chesterfield" ...
 $ AwayTeam: chr  "Maidenhead" "Fylde" "Yeovil" "Dover Athletic" ...
 $ FTHG    : int  0 1 1 1 0 0 1 1 2 1 ...
 $ FTAG    : int  1 2 0 2 0 2 0 4 2 3 ...
 $ FTR     : chr  "A" "A" "H" "A" ...
 - attr(*, ".internal.selfref")=<externalptr>
Russ Thomas
  • 938
  • 4
  • 13
  • 23