Questions tagged [disk.frame]

30 questions
0
votes
0 answers

Unexpected symbol when concatenate strings in R

I met the following problem. My dataset "Sales" store as disk.frame. There are two character variables "Item-Entity" and "SBLOC". I want to create another variable concatenated these variables: Sales <- as.disk.frame(Sales) %>% mutate("Item-Loc" =…
grislepak
  • 31
  • 3
0
votes
0 answers

how many data transformation can i perform in disk.frame r

I have a dataset about 16GB. To reduce the RAM usage I transformed it into disk.frame After few manipulations - just mutate 10 variables I tried to move new table to RAM using collect function. The error message is the following Error:…
grislepak
  • 31
  • 3
0
votes
1 answer

Remove duplicate rows in a diskframe object

I have a diskframe object with many duplicate rows. How could I remove them? (The original dataframe is 10 Gb size)
Irene M
  • 11
0
votes
1 answer

How to import the data in disk.frame folder into R environment

There is a folder 'C:\tmp_flights.df' that created by disk.frame package , how to import the data into R environment again ? Thanks! Below code created the disc.frame folder library(disk.frame) library(nycflights13) library(tidyverse) …
anderwyang
  • 1,801
  • 4
  • 18
0
votes
1 answer

Using disk.frame, but still reaching memory limit issue

Problem: I am trying to perform a correlation test on a large dataset: the data.table can exist in memory, but operating on it with Hmisc::rcorr() or corrr::correlate() eventually runs into the memory limit. > Error: cannot allocate vector of size…
Buzz B
  • 75
  • 7
0
votes
0 answers

Still getting cannot allocate vector of size issues despite using Disk Frame in R

I've been trying to work with disk frame to load up a file that's about 45 gbs. I have used the code below to convert the csv to a disk frame: output_path = file.path(tempdir(), "tmp_cars.df") disk <- csv_to_disk.frame("full-drivers.csv", outdir =…
Shazzzam
  • 1
  • 2
0
votes
0 answers

Summary statistics on out-of-memory file

I have a csv file that's 120GB in size which is a set of numerical values grouped by categorical variables. eg. df<-as.data.frame(x=rbing(rep("BLO",100),rep("LR",100)), y=runif(200)) I would like to calculate some summary statistics using…
HCAI
  • 2,213
  • 8
  • 33
  • 65
0
votes
0 answers

How to store where a passenger gets on and off a train whilst minimising size of file for plotting?

I have 500GB of .csv data which include these three (and other) variables: 1. where a passenger gets on a train, 2. where they get off and 3. The time it takes. I need to make box plots of the time it takes based on where they got on and where they…
HCAI
  • 2,213
  • 8
  • 33
  • 65
0
votes
1 answer

columns jumbled after using csv_to_disk.frame

i have around 15 GB of zipped data in 30 minute packages. unzipping and reading them with either unzip and readr or fread works just fine but the ram-requirements don't allow me to read in as many files as i wish. so i've tried to use the disk.frame…
D.J
  • 1,180
  • 1
  • 8
  • 17
0
votes
1 answer

How should we choose the compression rate with rbindlist.disk.frame?

It's set to 50 by default on a scale of 1 to 100. I have an especially large disk frame and I'm considering using a high number. What are the important trade-offs to consider?
Cauder
  • 2,157
  • 4
  • 30
  • 69
0
votes
2 answers

My group by doesn't appear to be working in disk frames

I ran a group by on a large dataset (>20GB) and it doesn't appear to be working quite right This is my code mydf[, .(value = n_distinct(list_of_id, na.rm = T)), by = .(week), keep = c("list_of_id",…
Cauder
  • 2,157
  • 4
  • 30
  • 69
0
votes
1 answer

How does srckeep affect the underlying disk frame?

I have a disk frame with these columns key_a key_b key_c value Say the disk frame is 200M rows and I'd like to group it by key_b. Additionally, I want to keep the underlying disk frame in tact and unchanged so I could later on join it to something…
Cauder
  • 2,157
  • 4
  • 30
  • 69
0
votes
1 answer

How do I bind two disk frames together?

I have two disk frame and each are about 20GB worth of files. It's too big to merge as data tables because the process requires more than the memory I have available. I tried using this code: output <- rbindlist(list(df1, df2)) The wrinkle is that…
Cauder
  • 2,157
  • 4
  • 30
  • 69
0
votes
1 answer

How do I find out how many workers is my disk.frame using?

I am using the disk.frame package and I wanted to know how many workers is disk.frame using to perform the operations? I looked through disk.frame documentation and can't find such a function.
xiaodai
  • 14,889
  • 18
  • 76
  • 140
-1
votes
2 answers

Does disk.frame allow to work with large lists in R?

I am producing a very big datasets (>120 Gb), which are actually a list of named (100x100x3) matrices. A very large lists (millions of records). They are then fed to CNN and classified to one of 4 categories. Processing this amount of data at once…
ramen
  • 691
  • 4
  • 20
1
2