Questions tagged [large-data]

Large data is data that is difficult to process and manage because its size is usually beyond the limits of the software being used to perform the analysis.

A large amount of data. Although there is no exact number that defines "large" (this is probably because "large" is different depending on the situation: in the web, 1MB or 2MB might be large. In an application that is meant to clone hard drives 5TB might be large), a specific number is unnecessary as this tag is meant for questions regarding problems caused by too much data, so it doesn't matter how much that is.

2088 questions
23
votes
3 answers

Best of breed indexing data structures for Extremely Large time-series

I'd like to ask fellow SO'ers for their opinions regarding best of breed data structures to be used for indexing time-series (aka column-wise data, aka flat linear). Two basic types of time-series exist based on the sampling/discretisation…
Xander Tulip
  • 1,438
  • 2
  • 17
  • 32
23
votes
3 answers

Possibility to apply online algorithms on big data files with sklearn?

I would like to apply fast online dimensionality reduction techniques such as (online/mini-batch) Dictionary Learning on big text corpora. My input data naturally do not fit in the memory (this is why i want to use an online algorithm) so i am…
register
  • 801
  • 1
  • 8
  • 15
22
votes
4 answers

How to store extremely large numbers?

For example I have a factorial program that needs to save really huge integers that can be 50+ digits long. The absolute maximum primitive data type in C++ is unsigned long long int with a maximum value 18446744073709551615 which is only 20 digits…
Oleksiy
  • 37,477
  • 22
  • 74
  • 122
20
votes
2 answers

How do you encrypt large files / byte streams in Go?

I have some large files I would like to AES encrypt before sending over the wire or saving to disk. While it seems possible to encrypt streams, there seems to be warnings against doing this and instead people recommend splitting the files into…
Xeoncross
  • 55,620
  • 80
  • 262
  • 364
20
votes
5 answers

C Programming File Reading/Writing Technique

It is my first time to create a program with file reading and writing involved. Actually I'm wondering what is the best technique on doing this. Because when I compared my work with my classmate, our logic are very different from each other. You…
newbie
  • 14,582
  • 31
  • 104
  • 146
20
votes
1 answer

Split large file according to value in single column (AWK)

I would like to split a large file (10^6 rows) according to the value in the 6th column (about 10*10^3 unique values). However, I can't get it working because of the number of records. It should be easy but it's taking hours already and I'm not…
Elmer
  • 255
  • 1
  • 2
  • 10
19
votes
3 answers

Removing duplicates on very large datasets

I'm working on a 13.9 GB csv file that contains around 16 million rows and 85 columns. I know there are potentially a few hundred thousand rows that are duplicates. I ran this code to remove them import…
Vlad
  • 395
  • 3
  • 9
19
votes
4 answers

Large fixed effects binomial regression in R

I need to run a logistic regression on a relatively large data frame with 480.000 entries with 3 fixed effect variables. Fixed effect var A has 3233 levels, var B has 2326 levels, var C has 811 levels. So all in all I have 6370 fixed effects. The…
Phil
  • 954
  • 1
  • 8
  • 22
17
votes
6 answers

With Haskell, how do I process large volumes of XML?

I've been exploring the Stack Overflow data dumps and thus far taking advantage of the friendly XML and “parsing” with regular expressions. My attempts with various Haskell XML libraries to find the first post in document-order by a particular user…
Greg Bacon
  • 134,834
  • 32
  • 188
  • 245
17
votes
3 answers

Python tools for out-of-core computation/data mining

I am interested in python mining data sets too big to sit in RAM but sitting within a single HD. I understand that I can export the data as hdf5 files, using pytables. Also the numexpr allows for some basic out-of-core computation. What would come…
user17375
  • 529
  • 4
  • 14
17
votes
4 answers

How can I analyse ~13GB of data?

I have ~300 text files that contain data on trackers, torrents and peers. Each file is organised like this: tracker.txt time torrent time peer time peer ... time torrent ... I have several files per tracker and much of the information…
WilliamMayor
  • 745
  • 6
  • 15
16
votes
6 answers

check 1 billion cell-phone numbers for duplicates

It's an interview question: There are 1 billion cell-phone numbers which has 11 digits, they are stored randomly in a file, for example 12345678910, the first digit gotta be 1. Go through these numbers to see whether there is one with…
Alcott
  • 17,905
  • 32
  • 116
  • 173
16
votes
3 answers

Dealing with huge data in select boxes

Hi I am using jQuery and retrieving "items" from one of my mySQL tables. I have around 20,000 "items" in that table and it is going to be used as a search parameter in my form. So basically they can search for "purchases" which contain that…
Girish Dusane
  • 1,120
  • 4
  • 12
  • 19
16
votes
1 answer

Moore-Penrose generalized inverse of a large sparse matrix

I have a square matrix with a few tens of thousands rows and columns with only a few 1 beside tons of 0, so I use the Matrix package to store that in R in an efficient way. The base::matrix object cannot handle that amount of cells, as I run out of…
daroczig
  • 28,004
  • 7
  • 90
  • 124
16
votes
2 answers

Why does MongoDB takes up so much space?

I am trying to store records with a set of doubles and ints (around 15-20) in mongoDB. The records mostly (99.99%) have the same structure. When I store the data in a root which is a very structured data storing format, the file is around 2.5GB for…
xcorat
  • 1,434
  • 2
  • 17
  • 34