Questions tagged [large-data]

Large data is data that is difficult to process and manage because its size is usually beyond the limits of the software being used to perform the analysis.

A large amount of data. Although there is no exact number that defines "large" (this is probably because "large" is different depending on the situation: in the web, 1MB or 2MB might be large. In an application that is meant to clone hard drives 5TB might be large), a specific number is unnecessary as this tag is meant for questions regarding problems caused by too much data, so it doesn't matter how much that is.

2088 questions
6
votes
4 answers

Sorting gigantic binary files with C#

I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format: byte[8]byte[4]byte[n] Where n is equal to the int32 value of byte[4]. This file has no delimiters and to…
Jeffrey Kevin Pry
  • 3,266
  • 3
  • 35
  • 67
6
votes
1 answer

pivot_longer with a very big data.frame, memory efficient approaches

I have a data.frame of hospital data with 11 million rows. Columns: ID (chr), outcome (1|0), 20x ICD-10 codes (chr). Rows: 10.6 million I wish to make the data tidy to allow modelling of diagnostic codes to a binary outcome. I would normally use…
JisL
  • 161
  • 8
6
votes
1 answer

Using dplyr in r with large dataset (4 million rows)

I m doing some data manipulation with dplyr that with my huge data(b) frame. I have been able to work successfully on smaller subsets of my data. I guess my problem is with the size of my data frame. I have data frame that has 4 million rows and 34…
Ozgur Alptekın
  • 505
  • 6
  • 19
6
votes
3 answers

How to handle large yet not big-data datasets?

I have a ~200gb dataset of approx 1.5 bln observations, on which I need to run some conditional analysis and data aggregation*. The thing is that I'm not used to (nor trained to handle) large datasets. I usually work on R or Python (with some Julia…
SomePhDStudentGuy
6
votes
2 answers

numpy.memmap for an array of strings?

Is it possible to use numpy.memmap to map a large disk-based array of strings into memory? I know it can be done for floats and suchlike, but this question is specifically about strings. I am interested in solutions for both fixed-length and…
NPE
  • 486,780
  • 108
  • 951
  • 1,012
6
votes
0 answers

How do I load a large dataset into Python from MS SQL Server?

Setup: I have a pre-processed dataset on an MS SQL Server that is about 500.000.000 rows and 20 columns, where one is a rather long text column (varchar(1300)), which amounts to about 35gb data space on the SQL database. I'm working on the physical…
iraserd
  • 669
  • 1
  • 8
  • 26
6
votes
3 answers

How to identify all sequential numbers not covered by 'to' and 'from' positions?

I have a data table that defines the start and end coordinates for a set of sequnces. For example: df1 <- data.frame(from = c(7, 22, 35, 21, 50), to = c(13, 29, 43, 31, 60)) Given start and end coordinates (ie 1 and 100), I am trying…
Powege
  • 685
  • 5
  • 12
6
votes
1 answer

NodeJS socket.IO disconnects when sending large Json

I'm writing a multilayer card game (like hearthstone) with Nodejs back-end and an angular front-end. I tried to connect the two with Socket.IO, but it turned out that if I send a JSON object over about 8000char(the gameState object), then the…
Ez Az
  • 73
  • 4
6
votes
1 answer

Optimal CLion VM memory settings for very large projects

Im currently working on a fork of a VERY LARGE project with about 7-8 * 10^6 LoC and 100000+ classes. The problem is, of course, that the indexer or CLion in general runs out of memory or is very slow and not responsive. I already saw the blog entry…
p0w3r
  • 133
  • 2
  • 13
6
votes
2 answers

All k nearest neighbors in 2D, C++

I need to find for each point of the data set all its nearest neighbors. The data set contains approx. 10 million 2D points. The data are close to the grid, but do not form a precise grid... This option excludes (in my opinion) the use of KD Trees,…
Ian
  • 169
  • 1
  • 3
  • 6
6
votes
2 answers

Maintaining a large table of unique values in MySQL

This is probably a common situation, but I couldn't find a specific answer on SO or Google. I have a large table (>10 million rows) of friend relationships on a MySQL database that is very important and needs to be maintained such that there are no…
eric
  • 1,453
  • 2
  • 20
  • 32
6
votes
1 answer

How to efficiently store a large Java map?

I am brute-forcing one game and I need to store data for all positions and outcomes. Data will likely be hundreds of Gb in size. I considered SQL, but I am afraid that lookups in a tight loop might kill performance. Program will iterate over…
Stepan
  • 1,391
  • 18
  • 40
6
votes
1 answer

MongoDB server freeze - large amount of collections

We have large MongoDB database (about 1,4mln collections), MongoDB 3.0, engine rocksDB, operating system Ubuntu 14.04. This DB is located on virtual machine (VmWare vCloud) with 16 cores and 108 GB RAM (currently mongoDB used 70GB of memory without…
Kenny6
  • 116
  • 6
6
votes
1 answer

Memory mapped file for numpy arrays

I need to read in parts of a huge numpy array stored in a memory mapped file, process the data and repeat for another part of the array. The whole numpy array takes up around 50 GB and my machine has 8 GB of RAM. I initially created the memory…
KartMan
  • 369
  • 3
  • 19
6
votes
1 answer

SQL query on H2 database table throws ArrayIndexOutOfBoundsException

I have a H2 database on which some queries work, while others are throwing an ArrayIndexOutOfBoundsException. For example: SELECT COLUMN_1 FROM MY_TABLE; // works fine SELECT COUNT(COLUMN_1) FROM MY_TABLE; // gives following error message: [Error…
Kaadzia
  • 1,393
  • 1
  • 14
  • 34