Questions tagged [large-data]

Large data is data that is difficult to process and manage because its size is usually beyond the limits of the software being used to perform the analysis.

A large amount of data. Although there is no exact number that defines "large" (this is probably because "large" is different depending on the situation: in the web, 1MB or 2MB might be large. In an application that is meant to clone hard drives 5TB might be large), a specific number is unnecessary as this tag is meant for questions regarding problems caused by too much data, so it doesn't matter how much that is.

2088 questions
7
votes
2 answers

numpy: boolean indexing and memory usage

Consider the following numpy code: A[start:end] = B[mask] Here: A and B are 2D arrays with the same number of columns; start and end are scalars; mask is a 1D boolean array; (end - start) == sum(mask). In principle, the above operation can be…
NPE
  • 486,780
  • 108
  • 951
  • 1,012
7
votes
0 answers

Using a data generator for HDF5 file in Python and Keras

I am having trouble writing a data_generator for using with fit_generator in Keras. I have an HDF5 file with 4-dimensional numpy arrays (3-D data, with an extra single dimension from processing) stored as separate datasets. Each data-set is of the…
7
votes
2 answers

How to speed up the extraction of a large tgz file with lots of small files?

I have a tar archive (17GB) which consists of many small files (all files <1MB ). How do I use this archive. Do I extract it ? using 7-zip on my laptop says it will take 20hrs (and I think it will take even more) Can I read/browse the contents of…
Vulcan
  • 307
  • 1
  • 7
  • 16
7
votes
4 answers

Memory-constrained external sorting of strings, with duplicates combined&counted, on a critical server (billions of filenames)

Our server produces files like {c521c143-2a23-42ef-89d1-557915e2323a}-sign.xml in its log folder. The first part is GUID; the second part is name template. I want to count the number of files with same name template. For instance, we…
Gqqnbig
  • 5,845
  • 10
  • 45
  • 86
7
votes
1 answer

How to read large (~20 GB) xml file in R?

I want to read data from large xml file (20 GB) and manipulate them. I tired to use "xmlParse()" but it gave me memory issue before loading. Is there any efficient way to do this? My data dump looks like this, …
Karthick
  • 357
  • 4
  • 13
7
votes
3 answers

How can REST API pass large JSON?

I am building a REST API and facing this issue: How can REST API pass very large JSON? Basically, I want to connect to Database and return the training data. The problem is in Database I have 400,000 data. If I wrap them into a JSON file and pass…
Freya Ren
  • 2,086
  • 6
  • 29
  • 39
7
votes
4 answers

adding successive four / n numbers in large matrix in R

I have very large dataset with dimension of 60K x 4 K. I am trying add every four values in succession in every row column wise. The following is smaller example dataset. set.seed(123) mat <- matrix (sample(0:1, 48, replace = TRUE), 4) …
SHRram
  • 4,127
  • 7
  • 35
  • 53
7
votes
4 answers

What is the best approach to export large CSV data using PHP/MySQL?

I'm working on a project that I need to pull out data from database which contains almost 10k rows then export it to CSV. I tried the normal method to download CSV but I'm always getting memory limit issue even if we already sets the memory_limit to…
eugene a.
  • 93
  • 1
  • 1
  • 4
7
votes
3 answers

Replacing punctuation in a data frame based on punctuation list

Using Canopy and Pandas, I have data frame a which is defined by: a=pd.read_csv('text.txt') df=pd.DataFrame(a) df.columns=["test"] test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.…
BernardL
  • 5,162
  • 7
  • 28
  • 47
7
votes
2 answers

How to improve performance of populating a massive tree view?

First of all I am answering my own question Q/A style, so I don't necessarily need anyone to answer this. It's something I've learned and many can make use of this. I have a tree view which consists of many different nodes. Each node has an object…
Jerry Dodge
  • 26,858
  • 31
  • 155
  • 327
7
votes
2 answers

sp_executesql or exec(@var) is too long. Maximum length is 8000

I have large queries so i cant use linked server in production by rules. i pass a varchar(max) which this has more than 8000 characters. but sp_executesql does not support more than 8000 characters then how can i execute my string?
angel
  • 4,474
  • 12
  • 57
  • 89
7
votes
1 answer

Running regression tree on large dataset in R

I am working with a dataset of roughly 1.5 million observations. I am finding that running a regression tree (I am using the mob()* function from the party package) on more than a small subset of my data is taking extremely long (I can't run on a…
Rob Donnelly
  • 2,256
  • 2
  • 20
  • 29
7
votes
2 answers

Add a column to a large MySql table while online

I need to add a new column to a table in MySQL DB (MyISAM table) that contains more than 20 Million rows. The process of adding the column must be in run-time, I mean that the app will still be running and rows will still be inserted and selected…
SimonW
  • 6,175
  • 4
  • 33
  • 39
7
votes
3 answers

Can javascript sort, filter, and render a very large table?

First of all, I have no idea of Javascript's capability on this. But would like to know if it is possible: To read from a text file and display a very large table (a couple dozens of columns and a few hundred thousands of rows), in sections; Not…
xyliu00
  • 726
  • 1
  • 9
  • 24
6
votes
3 answers

Store large amount of images on multiple servers

I would like to know what is the best solution for storing large amount of images on multiple servers like google, facebook. It seems that storing in filesystem is better then inside a database but what about using a noSQL DB like cassandra. Do…
Adam Paquette
  • 1,243
  • 1
  • 14
  • 28