Questions tagged [large-data]

Large data is data that is difficult to process and manage because its size is usually beyond the limits of the software being used to perform the analysis.

A large amount of data. Although there is no exact number that defines "large" (this is probably because "large" is different depending on the situation: in the web, 1MB or 2MB might be large. In an application that is meant to clone hard drives 5TB might be large), a specific number is unnecessary as this tag is meant for questions regarding problems caused by too much data, so it doesn't matter how much that is.

2088 questions
0
votes
0 answers

Hive join optimization and resource allocation

My table (MyTable ~365 GB) contains 2 year data of customer behaviors. It is partitioned by day, and clustered by customer_id into 64 buckets. On average, one day contains 8 million of entries. My task is to retrieve customers per day (~ 512 MB),…
Alex
  • 607
  • 5
  • 10
0
votes
2 answers

What are the different ways to access really large csv files?

I had been working on a project where I had to read and process very large csv files with millions of rows as fast as possible. I came across the link: https://nelsonslog.wordpress.com/2015/02/26/python-csv-benchmarks/ where the author has…
Phoenix
  • 373
  • 1
  • 4
  • 20
0
votes
0 answers

What is the efficient way to calculate Euclidean distance between elements in a n-dimensional large ArrayList with JAVA?

I wanna calculate Euclidean distances between each pairs of elements in a two dimensional array list in JAVA. this two dimensional array list consists of 40000 records in 40 dimensions. I encountered a memory deficiency problem: Exception in…
0
votes
0 answers

Laravel/PHP - query on large data and problem with error 500

I filter job offers on data from a database. As long as the table contained up to 10,000 records, everything worked great. $searchQuery = \App\JobOffers::searchOffer($search_text, $search_location, $job_function, $job_type, $job_experience,…
0
votes
4 answers

How can I find the largest number in a very large text file (~150 GB)?

I have a text file that has around 100000000 lines, each of the following type: string num1 num2 num3 ... num500 string num1 num2 num3 ... num40 I want to find the largest number present in this file. My current code reads each line, splits it by…
user6873419
0
votes
1 answer

How to optimize SQL Server table with 320 million + rows with only varchar(max) data types

I have a table with 320 million+ rows and 34 columns, all of varchar(max) datatype, and with no indexing. I am finding it extremely time consuming to summarize the whole table. Can anyone suggest best way to optimize this considering the following…
0
votes
1 answer

AutocompleteTextView with 1000 entries in Arraylist not working

Working on an android app, which has a form with autocompletetextview with custom adapter, for which we have arraylist of around 1k entries. But with 1K data it is not working. With around 400 entries, it is working but filtering is slow. What can…
alka aswal
  • 511
  • 1
  • 7
  • 22
0
votes
0 answers

Selecting only the most recent row from duplicates (efficiency important, large table)

SELECT * FROM epc e INNER JOIN epc max_ ON e.ADDRESS1 = max_.ADDRESS1 AND e.POSTCODE = max_.POSTCODE AND e.INSPECTION_DATE < max_.INSPECTION_DATE; I have a table with 13 million rows. Without the AND e.INSPECTION_DATE < max_.INSPECTION_DATE…
Scoop
  • 67
  • 6
0
votes
1 answer

Multiprocessing on a large dataset in python (Finding Duplicates)

I've a json file that I want to remove duplicate rows from, but it's too large to fit into memory. I found a way to get it done, but my guess is that it's not the best way. My problem is that it runs in 8 minutes for a 12gb dataset. But the…
0
votes
3 answers

Normalizing large amount of data for database

I have a large amount of data I need to store in a database. The data is: for every day of the month, there are 5 events. The 5 events are further split into 2 different sub-events which need to be kept separate, meaning for every day of the month,…
Al.
  • 2,285
  • 2
  • 22
  • 30
0
votes
2 answers

Using Spark to process dataset larger than the cluster can fit

I'm on Spark 2.3 cluster of 5 nodes, each with 12Gb of available memory, and am trying to work with Parquet dataset of approx 130Gb, on top of which I created a partitioned external Hive table. Let's say I would like to know the number of records…
Roman
  • 238
  • 1
  • 14
0
votes
0 answers

Convert 10^9 x 2 uint32 H5py dataset to list of tuples

I have data in a long HDF5 file, and the class I would like to use (igraph.Graph) seems to insist on a list of tuples in its init function. I have tried for loops, list(dataset), read_direct(dataset).tolist(), and [mylist.append(tuple(x) for x in…
Zach Boyd
  • 419
  • 1
  • 5
  • 23
0
votes
3 answers

how to speed up linq query on csv imported list

I have been tasked with matching 1.7 million records with some results which have been passed to me in a csv file. little bit of background to the below code, i have two lists... Certs which contains 5 properties with ID being the equivalent of a…
0
votes
3 answers

How do I split a combo list in a large text file?

my problem is that I have a very large database of emails and passwords and I need to send it to a mysql database. The .txt file format is something like…
Lectro
  • 25
  • 5
0
votes
1 answer

Warnings of "NAs introduced by coercion" in fread function

I am trying to use fread() to read in a table of 2 columns (x, y) and ~3 00 million rows (62 GB) and plot the x and y in a scatter plot. I am using "fread" and it works fine if I only use a small portion of the data, like 30000 rows. But if I run it…
Phoenix Mu
  • 648
  • 7
  • 12
1 2 3
99
100