Questions tagged [large-data]

Large data is data that is difficult to process and manage because its size is usually beyond the limits of the software being used to perform the analysis.

A large amount of data. Although there is no exact number that defines "large" (this is probably because "large" is different depending on the situation: in the web, 1MB or 2MB might be large. In an application that is meant to clone hard drives 5TB might be large), a specific number is unnecessary as this tag is meant for questions regarding problems caused by too much data, so it doesn't matter how much that is.

2088 questions
0
votes
1 answer

compare values in different chunks using pandas

Say I have in memory a large file, loaded using chunksize in pandas. Now I have to compare every value with the ones ajdacent to it. My problem is that I can't seem to select at the same time the extreme values (in first and last position) of two…
apocalypsis
  • 520
  • 8
  • 19
0
votes
1 answer

database/sql rows.scan hangs after 350K rows

I have a task to pull data from an Oracle Database and I am trying to pull huge data > 6MM records with 100 columns for processing. Need to convert the data to a Map. I was successfully able to process them for 350K records in less than 35 seconds.…
Jaya
  • 1
  • 1
0
votes
1 answer

Efficient way to read and write a Networkx Graph

For an open-source project, I am trying to use NetworkX in order to find attractors of Graph (called State Transition Graph). The thing is for nearly 2**33 loops, a function with a variety of inputs returns a list of tuples(nearly 5000 tuples) in…
Uday
  • 111
  • 6
0
votes
0 answers

Setting the Dask DataFrame index from a column with size is larger than available memory

I have a large parquet file (~1TB on disk) that I would like to process with Dask, and 512GB RAM available. One of the processing steps requires a join with a smaller DataFrame. I would like to join the DataFrames on indexes, as this should be more…
Steve OB
  • 63
  • 6
0
votes
1 answer

Error using writeRaster on a large RasterStack

I have a RasterStack in R called "preds2" that is 4.1 GB and was outputted from 4 RasterStacks and 2 RasterLayers (wveg, wfps_lag, wfps, ndvi, swt, lu): cl <- makeCluster(4) registerDoSNOW(cl) preds<-foreach(j = 1:nlayers(ndvi))%dopar%{ …
rachell
  • 19
  • 1
  • 5
0
votes
3 answers

ASPX.NET application out of memory exception for no reason

Here is the deal: when my web server starts up, it creates a couple of lengthy (20M of elements) arrays with really small objects (like 1-2-3 ints). The accumulative size of any individual array is NOT larger than 2GB (the limitation of CLR, see the…
Schultz9999
  • 8,717
  • 8
  • 48
  • 87
0
votes
0 answers

Parsing large JSON file, and download URL's of every object in Python

In Python I'm trying to download every single URL which is contained in a 180 MB JSON file. Even though it is only 180 MB, when I'm trying to open it with text-editor it uses 5.9 GB memory. So Jupyter is crashing when I try to read the JSON and…
erikci
  • 159
  • 7
0
votes
2 answers

MySQL - Executing intensive queries on live server

I'm having some issues dealing with updating and inserting millions of row in a MySQL Database. I need to flag 50 million rows in Table A, insert some data from the marked 50 million rows into Table B, then update those same 50 million rows in…
Ryan
  • 17,511
  • 23
  • 63
  • 88
0
votes
1 answer

Is it possible to use mongoDB geospacial indexes with grid FS

I have a large geojson feature collection which is over 16MB. I am hoping to insert the data into MongoDB so that I can utilize the geospatial functionality that MongoDB offers ($geoIntersects, $geoWithin, etc). Due to the large size of the file, I…
0
votes
0 answers

Large scale linearly-constrained convex quadratic optimization - R/Python/Gurobi

I have a series of linearly-constrained convex quadratic optimization problems that have around 100.000 variables, 1 linear constraint and 100.000 bound constraints (the same as the number of variables - the solution has to be positive). I am…
0
votes
1 answer

How to store a TB size C++ array on a cluster

I want to do a huge simulation that requires ~ 1 TB of data to describe a bunch of interacting particles (each has different interactions). Is it possible to store this data in a C++ array? I have access to a 60 node cluster. Each node has 2 CPUs…
Thermodynamix
  • 349
  • 2
  • 12
0
votes
1 answer

Django Postgres migration: Fastest way to backfill a column in a table with 100 Million rows

I have a table in Postgres Thing that has 100 Million rows. I have a column that was populated over time that stores some keys. The keys were prefixed before storing. Let's call it prefixed_keys. My task is to use the values of this column to…
0
votes
0 answers

Solving Massive Latency Issues with SQL Left Join

My computer is currently working toward hour 48 of a left join statement. This left join statement is to concatenate two matrices one is 47 million x 3 the other is 45 million x 2. The computer I'm running it on is a 9th gen i7 with 32gb memory and…
0
votes
1 answer

H5py, Merge matched lines from huge hdf5 file to smaller datasets

I have two huge hdf5 files, each with an index of ids, and each containing different information about each of those ids. I have read one into a small masked dataset (data), using only a select few ids. I now want to add to the dataset, using…
tom davison
  • 112
  • 6
0
votes
0 answers

Manipulating large sets of data in Matlab, asking for advice on a few things, cells and numeric array operations, with performance in mind

This is a cross-post from here: Link to post in the Mathworks community Currently I'm working with large data sets, I've saved those data set as matlab files with the two biggest files being 9.5GB and 5.9GB. These files contain a cell array each of…