Questions tagged [large-data]

Large data is data that is difficult to process and manage because its size is usually beyond the limits of the software being used to perform the analysis.

A large amount of data. Although there is no exact number that defines "large" (this is probably because "large" is different depending on the situation: in the web, 1MB or 2MB might be large. In an application that is meant to clone hard drives 5TB might be large), a specific number is unnecessary as this tag is meant for questions regarding problems caused by too much data, so it doesn't matter how much that is.

2088 questions
12
votes
3 answers

SELECT COUNT() vs mysql_num_rows();

I have a large table (60+) millions of records. I'm using PHP script to navigate through this table. PHP script (with pagination) loads very fast because: The table engine is InnoDB thus SELECT COUNT() is very slow and mysql_num_rows() is not an…
rinchik
  • 2,642
  • 8
  • 29
  • 46
11
votes
5 answers

Extremely large weighted average

I am using 64 bit matlab with 32g of RAM (just so you know). I have a file (vector) of 1.3 million numbers (integers). I want to make another vector of the same length, where each point is a weighted average of the entire first vector, weighted by…
Micah Manary
  • 217
  • 2
  • 11
11
votes
7 answers

Scalable, fast, text file backed database engine?

I am dealing with large amounts of scientific data that are stored in tab separated .tsv files. The typical operations to be performed are reading several large files, filtering out only certain columns/rows, joining with other sources of data,…
Roman Zenka
  • 3,514
  • 3
  • 31
  • 36
10
votes
6 answers

High-performance multi-tier tag filtering

I have a large database of artists, albums, and tracks. Each of these items may have one or more tags assigned via glue tables (track_attributes, album_attributes, artist_attributes). There are several thousand (or even hundred thousand) tags…
Chris Baker
  • 49,926
  • 12
  • 96
  • 115
10
votes
1 answer

Python fork(): passing data from child to parent

I have a main Python process, and a bunch or workers created by the main process using os.fork(). I need to pass large and fairly involved data structures from the workers back to the main process. What existing libraries would you recommend for…
NPE
  • 486,780
  • 108
  • 951
  • 1,012
10
votes
3 answers

How can I efficiently open 30gb of file and process pieces of it without slowing down?

I have a some large files (more than 30gb) with pieces of information which I need to do some calculations on, like averaging. The pieces I mention are the slices of file, and I know the beginning line numbers and count of following lines for each…
E.Ergin
  • 101
  • 3
10
votes
3 answers

SQL Server - Merging large tables without locking the data

I have a very large set of data (~3 million records) which needs to be merged with updates and new records on a daily schedule. I have a stored procedure that actually breaks up the record set into 1000 record chunks and uses the MERGE command with…
Josh
  • 16,286
  • 25
  • 113
  • 158
10
votes
3 answers

Symfony2 / Doctrine make $statement->execute() not "buffer" all values

I've got a basic codeset like this (inside a controller): $sql = 'select * from someLargeTable limit 1000'; $em = $this->getDoctrine()->getManager(); $conn = $em->getConnection(); $statement = $conn->prepare($sql); $statement->execute(); My…
Sarel
  • 1,210
  • 2
  • 15
  • 23
10
votes
2 answers

Python defaultdict for large data sets

I am using defaultdict to store millions of phrases, so my data structure looks like mydict['string'] = set(['other', 'strings']). It seems to work ok for smaller sets but when I hit anything over 10 million keys, my program just crashes with the…
Lezan
  • 667
  • 2
  • 7
  • 20
10
votes
2 answers

Using dplyr for frequency counts of interactions, must include zero counts

My question involves writing code using the dplyr package in R I have a relatively large dataframe (approx 5 million rows) with 2 columns: the first with an individual identifier (id), and a second with a date (date). At present, each row indicates…
Mark T Patterson
  • 397
  • 1
  • 2
  • 10
10
votes
3 answers

Split large excel file by number of rows

I have a large excel file with about 3000 rows. I would like to split this data into groups of 100 rows Is there a command in excel that can help me split this data into different sheets or files for every 100th row?
Victor Njoroge
  • 353
  • 2
  • 9
  • 22
10
votes
3 answers

Plotting a large number of points using matplotlib and running out of memory

I have a large (~6GB) text file in a simple format x1 y1 z1 x2 y2 z2 ... Since I may load this data more than once, I've created a np.memmap file for efficiency reasons: X,Y,Z = np.memmap(f_np_mmap,dtype='float32',mode='r',shape=shape).T What I'm…
Hooked
  • 84,485
  • 43
  • 192
  • 261
9
votes
3 answers

how to deal with large data sets with jquery isotope

I am planning on using the great isotope plugin for displaying a list of contacts and then allowing them to be filtered. The issue I have is that it works great for a small data set but i'm not sure the best way of scaling it up for 1000+ pieces of…
Josh
  • 6,256
  • 2
  • 37
  • 56
9
votes
0 answers

Building a 1,000M row MySQL table

reposted on serverfault Questions Question 1: as the size of the database table gets larger how can I tune MySQL to increase the speed of the LOAD DATA INFILE call? Question 2: would using a cluster of computers to load different csv files, improve…
Ben
  • 1,030
  • 10
  • 23
9
votes
1 answer

PHP Connection Reset on Large File Upload Regardless Correct Setting

I am having a very common problem which it seems that all the available solutions found are not working. We have a LAMP server which is receiving high amount of traffic. Using this server, we perform a regular file submission upload. On small file…
Heru S
  • 1,283
  • 2
  • 17
  • 28