1

does anyone know how to filter data automatically based on date_of_incident from socrata dataset in R in the first step of import to speed up read time?

this is what I have so far

token <- "n15hFiXqJU6DBItiSjA4jWD2U"
PoliceIncidents <- read.socrata("https://www.dallasopendata.com/resource/qv6i-rri7.csv", app_token = token)

#filter police incident data to 2019 to present

PoliceIncidents2019to2020 <- PoliceIncidents %>% filter(servyr > 2018)

here is the source data https://www.dallasopendata.com/Public-Safety/Police-Incidents/qv6i-rri7/data

Tom Schenk Jr
  • 478
  • 2
  • 8

2 Answers2

1

You can use filters in your original query to only pull incidents since 2019. This will speed up the read process, mostly from the server response that won't need to pass as much data. You'll need to use the "API field name" to construct the query.

In this case:

PoliceIncidents <- read.socrata("https://www.dallasopendata.com/resource/qv6i-rri7.csv?servyr > 2018")
Tom Schenk Jr
  • 478
  • 2
  • 8
0

For big csvs, I like the package vroom from tidyverse. It's a lot faster than read_csv. With vroom, it's often easier to swallow the whole thing, then filter.

library(vroom)
library(tidyverse)

df_raw<-vroom('Police_Incidents.csv')
occurence_2019<-df_raw %>%
  filter(`Year1 of Occurrence`>=2019)

This only took like 10 seconds.

Joe Erinjeri
  • 1,200
  • 1
  • 7
  • 15
  • i want it to pull directly from the api though so it can be updated on the server side when i run it in rshiny each time rather than uploading a csv? it's too large right now so i wanted to pull the last 3 months in the import step instead of importing and then filtering – Kristina Paterson Nov 30 '20 at 15:29