Questions tagged [large-data]

Large data is data that is difficult to process and manage because its size is usually beyond the limits of the software being used to perform the analysis.

A large amount of data. Although there is no exact number that defines "large" (this is probably because "large" is different depending on the situation: in the web, 1MB or 2MB might be large. In an application that is meant to clone hard drives 5TB might be large), a specific number is unnecessary as this tag is meant for questions regarding problems caused by too much data, so it doesn't matter how much that is.

2088 questions
9
votes
5 answers

Dealing with very large datasets & just in time loading

I have a .NET application written in C# (.NET 4.0). In this application, we have to read a large dataset from a file and display the contents in a grid-like structure. So, to accomplish this, I placed a DataGridView on the form. It has 3 columns,…
SomethingBetter
  • 1,294
  • 3
  • 16
  • 32
9
votes
5 answers

How to optimize operations on large (75,000 items) sets of booleans in Python?

There's this script called svnmerge.py that I'm trying to tweak and optimize a bit. I'm completely new to Python though, so it's not easy. The current problem seems to be related to a class called RevisionSet in the script. In essence what it does…
Vilx-
  • 104,512
  • 87
  • 279
  • 422
9
votes
1 answer

Insert large amount of data to BigQuery via bigquery-python library

I have large csv files and excel files where I read them and create the needed create table script dynamically depending on the fields and types it has. Then insert the data to the created table. I have read this and understood that I should send…
Marlon Abeykoon
  • 11,927
  • 4
  • 54
  • 75
9
votes
2 answers

Methods in R for large complex survey data sets?

I am not a survey methodologist or demographer, but am an avid fan of Thomas Lumley's R survey package. I've been working with a relatively large complex survey data set, the Healthcare Cost and Utilization Project (HCUP) National Emergency…
charlie
  • 602
  • 4
  • 12
9
votes
0 answers

Large sparse matrix to matrix error

I want to apply mice package, but I cannot convert large sparse matrix to…
chiao-ling Chen
  • 131
  • 1
  • 4
9
votes
1 answer

Python plot Large matrix using matplotlib

I am trying to plot a matrix with 2000 columns and 200000 rows. I can test plot and test export the matrix figure fine when the matrix is small using matshow(my_matrix) show() However, when more rows are added to my_matrix, the figure becomes very…
emily
  • 198
  • 2
  • 10
9
votes
1 answer

reading large (450000+ chars) strings from file

So, I'm dealing with integrating a legacy system. It produces a large text file, that prints instructions in one large string. Really large string. We're talking 450 000 characters or more. I need to break this up in to lines, one per instruction.…
Eric Olsvik
  • 103
  • 4
9
votes
1 answer

fread protection stack overflow error

I'm using fread in data.table (1.8.8, R 3.0.1) in a attempt to read very large files. The file in questions has 313 rows and ~6.6 million cols of numeric data rows and the file is around around 12gb. This is a Centos 6.4 with 512GB of RAM. When I…
mpmorley
  • 93
  • 1
  • 4
9
votes
2 answers

Find Top 10 Most Frequent visited URl, data is stored across network

Source: Google Interview Question Given a large network of computers, each keeping log files of visited urls, find the top ten most visited URLs. Have many large int (visits)> maps. Calculate < string (url) -> int (sum of visits…
Spandan
  • 2,128
  • 5
  • 25
  • 37
9
votes
4 answers

Replacing values in large table using conversion table

I am trying to replace values in a large space-delimited text-file and could not find a suitable answer for this specific problem: Say I have a file "OLD_FILE", containing a header and approximately 2 million rows: COL1 COL2 COL3 COL4 COL5 rs10 7…
KJ_
  • 336
  • 1
  • 3
  • 11
9
votes
1 answer

numpy save/load corrupting an array

I am trying to save a large numpy array and reload it. Using numpy.save and numpy.load, the array values are corrupted/change. The shape and data type of the array pre-saving, and post-loading, are the same, but the post-loading array has the vast…
wdwvt1
  • 91
  • 1
  • 4
9
votes
6 answers

Fastest way to transfer Excel table data to SQL 2008R2

Does anyone know the fastest way to get data from and Excel table (VBA Array) to a table on SQL 2008 without using an external utility (i.e. bcp)? Keep in mind my datasets are usually 6500-15000 rows, and about 150-250 columns; and I end up…
cshenderson
  • 103
  • 1
  • 1
  • 9
8
votes
4 answers

Java library for storing and processing large (up to 600k vertices) graphs

I'm working on a project which will involve running algorithms on large graphs. The largest two have around 300k and 600k vertices (fairly sparse I think). I'm hoping to find a java library that can handle graphs that large, and also trees of a…
Maltiriel
  • 793
  • 2
  • 11
  • 28
8
votes
1 answer

Can Apache Solr Handle TeraByte Large Data

I am an apache solr user about a year. I used solr for simple search tools but now I want to use solr with 5TB of data. I assume that 5TB data will be 7TB when solr index it according to filter that I use. And then I will add nearly 50MB of data per…
Mustafa
  • 146
  • 2
  • 7
8
votes
1 answer

Java implementation of singular value decomposition for large sparse matrices

I'm just wondering if anyone out there knows of a java implementation of singular value decomposition (SVD) for large sparse matrices? I need this implementation for latent semantic analysis (LSA). I tried the packages from UJMP and JAMA but they…
jake
  • 1,405
  • 3
  • 19
  • 33