Questions tagged [large-data]

Large data is data that is difficult to process and manage because its size is usually beyond the limits of the software being used to perform the analysis.

A large amount of data. Although there is no exact number that defines "large" (this is probably because "large" is different depending on the situation: in the web, 1MB or 2MB might be large. In an application that is meant to clone hard drives 5TB might be large), a specific number is unnecessary as this tag is meant for questions regarding problems caused by too much data, so it doesn't matter how much that is.

2088 questions

votes

1 answer

How to add wait-time to complete a request and get the response in Rest-Assured

I am trying to do an upload via api, when the data is small eg:100 rows the upload works fine and I get the response as expected but when the upload is large, eg:1M rows, the test fails with Time-out-exception. How can I handle this? Is…

asked Jun 25 '19 at 04:37

Mano Kugan

votes

1 answer

How to optimize very large array construction with numpy or scipy

I am processing very large data sets on python 64 bits and need some help to optimize my interpolation code. I am used to using numpy to avoid loops but here there are 2 loops that I can't find a way to avoid. The main problem is also that the size…

python numpy optimization scipy large-data

asked Jun 14 '19 at 08:16

Mathilde L

votes

3 answers

which delete statement is better for deleting millions of rows

I have table which contains millions of rows. I want to delete all the data which is over a week old based on the value of column last_updated. so here are my two queries, Approach 1: Delete from A where to_date(last_updated,''yyyy-mm-dd'')<…

sql oracle sql-delete large-data

asked Jun 07 '19 at 06:57

Shweta

votes

1 answer

Scikit Learn implementation of DBSCAN for 0.7 million data points with 2 columns (Lat and Long) consumes 128GB+ RAM. How to fix this memory issue?

We are facing memory issues while implementing scikit learn's DBSCAN for 0.7 million data points with 2 columns (latitude and longitude). We also tried changing epsilon values to small numbers and reducing the number of minimum required points for…

python-3.x optimization scikit-learn large-data dbscan

asked Jun 06 '19 at 11:29

keshav kumar

votes

0 answers

Large dataset: None of the configured nodes are available for large dataset

There are a lot of questions about this error, but none for this condition. I am running Elasticsearch 5.4.1 with a java client(1.8) which uses the API to make Elasticsearch calls. It is on a Mac. I have a large number of documents which I have to…

java elasticsearch large-data transport

asked Jun 02 '19 at 17:14

user2689782

votes

1 answer

Selecting Distinct Items within Array using PowerShell and Linq

I have been banging my head on this problem for a few hours. I have a multi-dimensional array and I need to select the unique items based on two "columns". Is there an efficient .Net or otherwise way to do this and achieve the desired output? The…

arrays .net linq powershell large-data

asked Jun 02 '19 at 16:30

BDubs

votes

2 answers

Read large CSV in PowerShell parse multiple columns for unique values save results based on oldest value in column

I have a large 10 million row file (currently CSV). I need to read through the file, and remove duplicate items based on multiple columns. Example line of data would look something like: ComputerName, IPAddress, MacAddress, CurrentDate,…

powershell csv unique large-data

asked Jun 02 '19 at 14:26

BDubs

votes

1 answer

Cluster analysis of large dataset containing only categorical variables

I have been given the task of clustering our customers base on products they bought together. My data contains 500,000 rows related to each customer and 8,000 variables (product ids). Each variable is a one hot encode vector that shows if a customer…

python cluster-analysis large-data

asked May 30 '19 at 15:14

mahdi ebrahimi

votes

0 answers

Cannot import CSV file into R due to size

I'm trying to import a large (14+gb) file into RStudio for use in a project, but I've run into some roadblocks. I installed the 'ff' package to make this easier, but I keep having bugs that I do not know how to fix. Thank you! This is the code that…

r large-data read.csv

asked May 21 '19 at 20:17

Andrew Furlong

votes

1 answer

Move/copy millions of images from Macos to external drive to ubuntu server

I have created a dataset of millions (>15M, so far) of images for a machine-learning project, taking up over 500GB of storage. I created them on my Macbook Pro but want to get them to our DGX1 (GPU cluster) somehow. I thought it would be faster to…

macos ubuntu file-transfer large-data

asked May 20 '19 at 15:02

SciGuy

votes

2 answers

How to apply the MINUS efficiently on mysql query for tables with large data

I have 2 tables as the following - CREATE TABLE IF NOT EXISTS `nl_members` ( `member_id` int(10) unsigned NOT NULL AUTO_INCREMENT, `member_email` varchar(100) COLLATE utf8_unicode_ci NOT NULL, `member_confirmation_code` varchar(35) COLLATE…

mysql query-optimization large-data

asked Apr 11 '11 at 04:50

anjan

3,147
6
26
31

votes

1 answer

How to couple datapoints in the same table based on timestamps?

I have a big mySQL table that contains values for all kinds of data (all with a different data_id) and a timestamp (unix timestamp in ms). I trying to build a (real-time) plotter for all this data and I want to be able to plot any data on the…

mysql unix-timestamp large-data

asked May 14 '19 at 12:55

Marijn Kalter

votes

0 answers

How to filter stocks in a large dataset - python

I have a large dataset of over 20,000 stocks from 1964-2018. (It's CRSP data I got from my university). I now want to apply the following filter technique according to Amihud (2002): 1. include all stocks that have a price greater than $5 at end of…

python large-data finance stock

asked May 13 '19 at 18:51

Sebastian

votes

0 answers

Subset of features on external memory

I have a large file that I'm not able to load so I'm using a local file with xgb.DMatrix. But I'd like to use only a subset of the features. The documentation on xgboost says that the colset argument on slice is "currently not used" and there is no…

r large-data xgboost

asked May 06 '19 at 20:47

user5029763

1,903
1
15
23

votes

3 answers

How to sort a very large array in C

I want to sort on the order of four million long longs in C. Normally I would just malloc() a buffer to use as an array and call qsort() but four million * 8 bytes is one huge chunk of contiguous memory. What's the easiest way to do this? I rate…

c arrays sorting qsort large-data

asked Apr 07 '11 at 21:46

hippietrail

15,848
18
99
158

Prev 1 2 3

…

99 100 Next