Questions tagged [large-data]

Large data is data that is difficult to process and manage because its size is usually beyond the limits of the software being used to perform the analysis.

A large amount of data. Although there is no exact number that defines "large" (this is probably because "large" is different depending on the situation: in the web, 1MB or 2MB might be large. In an application that is meant to clone hard drives 5TB might be large), a specific number is unnecessary as this tag is meant for questions regarding problems caused by too much data, so it doesn't matter how much that is.

2088 questions
0
votes
1 answer

How to add wait-time to complete a request and get the response in Rest-Assured

I am trying to do an upload via api, when the data is small eg:100 rows the upload works fine and I get the response as expected but when the upload is large, eg:1M rows, the test fails with Time-out-exception. How can I handle this? Is…
Mano Kugan
  • 207
  • 6
  • 23
0
votes
1 answer

How to optimize very large array construction with numpy or scipy

I am processing very large data sets on python 64 bits and need some help to optimize my interpolation code. I am used to using numpy to avoid loops but here there are 2 loops that I can't find a way to avoid. The main problem is also that the size…
0
votes
3 answers

which delete statement is better for deleting millions of rows

I have table which contains millions of rows. I want to delete all the data which is over a week old based on the value of column last_updated. so here are my two queries, Approach 1: Delete from A where to_date(last_updated,''yyyy-mm-dd'')<…
Shweta
  • 219
  • 5
  • 18
0
votes
1 answer

Scikit Learn implementation of DBSCAN for 0.7 million data points with 2 columns (Lat and Long) consumes 128GB+ RAM. How to fix this memory issue?

We are facing memory issues while implementing scikit learn's DBSCAN for 0.7 million data points with 2 columns (latitude and longitude). We also tried changing epsilon values to small numbers and reducing the number of minimum required points for…
0
votes
0 answers

Large dataset: None of the configured nodes are available for large dataset

There are a lot of questions about this error, but none for this condition. I am running Elasticsearch 5.4.1 with a java client(1.8) which uses the API to make Elasticsearch calls. It is on a Mac. I have a large number of documents which I have to…
user2689782
  • 747
  • 14
  • 31
0
votes
1 answer

Selecting Distinct Items within Array using PowerShell and Linq

I have been banging my head on this problem for a few hours. I have a multi-dimensional array and I need to select the unique items based on two "columns". Is there an efficient .Net or otherwise way to do this and achieve the desired output? The…
BDubs
  • 73
  • 1
  • 14
0
votes
2 answers

Read large CSV in PowerShell parse multiple columns for unique values save results based on oldest value in column

I have a large 10 million row file (currently CSV). I need to read through the file, and remove duplicate items based on multiple columns. Example line of data would look something like: ComputerName, IPAddress, MacAddress, CurrentDate,…
BDubs
  • 73
  • 1
  • 14
0
votes
1 answer

Cluster analysis of large dataset containing only categorical variables

I have been given the task of clustering our customers base on products they bought together. My data contains 500,000 rows related to each customer and 8,000 variables (product ids). Each variable is a one hot encode vector that shows if a customer…
0
votes
0 answers

Cannot import CSV file into R due to size

I'm trying to import a large (14+gb) file into RStudio for use in a project, but I've run into some roadblocks. I installed the 'ff' package to make this easier, but I keep having bugs that I do not know how to fix. Thank you! This is the code that…
0
votes
1 answer

Move/copy millions of images from Macos to external drive to ubuntu server

I have created a dataset of millions (>15M, so far) of images for a machine-learning project, taking up over 500GB of storage. I created them on my Macbook Pro but want to get them to our DGX1 (GPU cluster) somehow. I thought it would be faster to…
SciGuy
  • 115
  • 4
0
votes
2 answers

How to apply the MINUS efficiently on mysql query for tables with large data

I have 2 tables as the following - CREATE TABLE IF NOT EXISTS `nl_members` ( `member_id` int(10) unsigned NOT NULL AUTO_INCREMENT, `member_email` varchar(100) COLLATE utf8_unicode_ci NOT NULL, `member_confirmation_code` varchar(35) COLLATE…
anjan
  • 3,147
  • 6
  • 26
  • 31
0
votes
1 answer

How to couple datapoints in the same table based on timestamps?

I have a big mySQL table that contains values for all kinds of data (all with a different data_id) and a timestamp (unix timestamp in ms). I trying to build a (real-time) plotter for all this data and I want to be able to plot any data on the…
0
votes
0 answers

How to filter stocks in a large dataset - python

I have a large dataset of over 20,000 stocks from 1964-2018. (It's CRSP data I got from my university). I now want to apply the following filter technique according to Amihud (2002): 1. include all stocks that have a price greater than $5 at end of…
Sebastian
  • 13
  • 4
0
votes
0 answers

Subset of features on external memory

I have a large file that I'm not able to load so I'm using a local file with xgb.DMatrix. But I'd like to use only a subset of the features. The documentation on xgboost says that the colset argument on slice is "currently not used" and there is no…
user5029763
  • 1,903
  • 1
  • 15
  • 23
0
votes
3 answers

How to sort a very large array in C

I want to sort on the order of four million long longs in C. Normally I would just malloc() a buffer to use as an array and call qsort() but four million * 8 bytes is one huge chunk of contiguous memory. What's the easiest way to do this? I rate…
hippietrail
  • 15,848
  • 18
  • 99
  • 158