Questions tagged [large-data]

Large data is data that is difficult to process and manage because its size is usually beyond the limits of the software being used to perform the analysis.

A large amount of data. Although there is no exact number that defines "large" (this is probably because "large" is different depending on the situation: in the web, 1MB or 2MB might be large. In an application that is meant to clone hard drives 5TB might be large), a specific number is unnecessary as this tag is meant for questions regarding problems caused by too much data, so it doesn't matter how much that is.

2088 questions

votes

0 answers

Hive join optimization and resource allocation

My table (MyTable ~365 GB) contains 2 year data of customer behaviors. It is partitioned by day, and clustered by customer_id into 64 buckets. On average, one day contains 8 million of entries. My task is to retrieve customers per day (~ 512 MB),…

asked Jan 16 '19 at 23:48

Alex

votes

2 answers

What are the different ways to access really large csv files?

I had been working on a project where I had to read and process very large csv files with millions of rows as fast as possible. I came across the link: https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/ where the author has…

python pandas algorithm large-data data-processing

asked Jan 15 '19 at 11:18

Phoenix

votes

0 answers

What is the efficient way to calculate Euclidean distance between elements in a n-dimensional large ArrayList with JAVA?

I wanna calculate Euclidean distances between each pairs of elements in a two dimensional array list in JAVA. this two dimensional array list consists of 40000 records in 40 dimensions. I encountered a memory deficiency problem: Exception in…

java arraylist out-of-memory large-data euclidean-distance

asked Jan 13 '19 at 08:00

amIllusionist

votes

0 answers

Laravel/PHP - query on large data and problem with error 500

I filter job offers on data from a database. As long as the table contained up to 10,000 records, everything worked great. $searchQuery = \App\JobOffers::searchOffer($search_text, $search_location, $job_function, $job_type, $job_experience,…

php laravel large-data

asked Jan 02 '19 at 14:38

Łukasz Nizioł

votes

4 answers

How can I find the largest number in a very large text file (~150 GB)?

I have a text file that has around 100000000 lines, each of the following type: string num1 num2 num3 ... num500 string num1 num2 num3 ... num40 I want to find the largest number present in this file. My current code reads each line, splits it by…

python-3.x text awk large-data

asked Jan 01 '19 at 05:20

user6873419

votes

1 answer

How to optimize SQL Server table with 320 million + rows with only varchar(max) data types

I have a table with 320 million+ rows and 34 columns, all of varchar(max) datatype, and with no indexing. I am finding it extremely time consuming to summarize the whole table. Can anyone suggest best way to optimize this considering the following…

sql-server sql-server-2008-r2 large-data database-optimization

asked Dec 30 '18 at 12:56

Shahbaz Khan

votes

1 answer

AutocompleteTextView with 1000 entries in Arraylist not working

Working on an android app, which has a form with autocompletetextview with custom adapter, for which we have arraylist of around 1k entries. But with 1K data it is not working. With around 400 entries, it is working but filtering is slow. What can…

android performance large-data autocompletetextview

asked Dec 26 '18 at 16:09

alka aswal

votes

0 answers

Selecting only the most recent row from duplicates (efficiency important, large table)

SELECT * FROM epc e INNER JOIN epc max_ ON e.ADDRESS1 = max_.ADDRESS1 AND e.POSTCODE = max_.POSTCODE AND e.INSPECTION_DATE < max_.INSPECTION_DATE; I have a table with 13 million rows. Without the AND e.INSPECTION_DATE < max_.INSPECTION_DATE…

mysql duplicates large-data

asked Dec 19 '18 at 15:12

Scoop

votes

1 answer

Multiprocessing on a large dataset in python (Finding Duplicates)

I've a json file that I want to remove duplicate rows from, but it's too large to fit into memory. I found a way to get it done, but my guess is that it's not the best way. My problem is that it runs in 8 minutes for a 12gb dataset. But the…

python multithreading multiprocessing large-data

asked Dec 18 '18 at 02:36

Mohit Ruke

votes

3 answers

Normalizing large amount of data for database

I have a large amount of data I need to store in a database. The data is: for every day of the month, there are 5 events. The 5 events are further split into 2 different sub-events which need to be kept separate, meaning for every day of the month,…

database dataset normalization large-data

asked Mar 20 '11 at 15:20

Al.

2,285
2
22
30

votes

2 answers

Using Spark to process dataset larger than the cluster can fit

I'm on Spark 2.3 cluster of 5 nodes, each with 12Gb of available memory, and am trying to work with Parquet dataset of approx 130Gb, on top of which I created a partitioned external Hive table. Let's say I would like to know the number of records…

apache-spark apache-spark-sql large-data parquet

asked Dec 05 '18 at 05:39

Roman

votes

0 answers

Convert 10^9 x 2 uint32 H5py dataset to list of tuples

I have data in a long HDF5 file, and the class I would like to use (igraph.Graph) seems to insist on a list of tuples in its init function. I have tried for loops, list(dataset), read_direct(dataset).tolist(), and [mylist.append(tuple(x) for x in…

python numpy igraph large-data h5py

asked Dec 01 '18 at 20:49

Zach Boyd

votes

3 answers

how to speed up linq query on csv imported list

I have been tasked with matching 1.7 million records with some results which have been passed to me in a csv file. little bit of background to the below code, i have two lists... Certs which contains 5 properties with ID being the equivalent of a…

c# linq csv large-data

asked Nov 30 '18 at 11:11

Jon Bridger

votes

3 answers

How do I split a combo list in a large text file?

my problem is that I have a very large database of emails and passwords and I need to send it to a mysql database. The .txt file format is something like…

python mysql phpmyadmin large-data large-files

asked Nov 29 '18 at 22:55

Lectro

votes

1 answer

Warnings of "NAs introduced by coercion" in fread function

I am trying to use fread() to read in a table of 2 columns (x, y) and ~3 00 million rows (62 GB) and plot the x and y in a scatter plot. I am using "fread" and it works fine if I only use a small portion of the data, like 30000 rows. But if I run it…

r data.table large-data large-files fread

asked Nov 28 '18 at 20:47

Phoenix Mu

Prev 1 2 3

…

100