Questions tagged [large-data]

Large data is data that is difficult to process and manage because its size is usually beyond the limits of the software being used to perform the analysis.

A large amount of data. Although there is no exact number that defines "large" (this is probably because "large" is different depending on the situation: in the web, 1MB or 2MB might be large. In an application that is meant to clone hard drives 5TB might be large), a specific number is unnecessary as this tag is meant for questions regarding problems caused by too much data, so it doesn't matter how much that is.

2088 questions
0
votes
1 answer

Excel VBA How do I compare values in two large datasets with a mass defect/error?

I am active in the field of analytical chemistry for my internship and wish to compare large datasets (two columns up to 15,000 rows). The main idea of this is that I have two columns with mass data (with 4 decimals) in which a macro should look for…
0
votes
1 answer

X axis invisible for large dataset

I am new to python and I am trying to plot the data where date and time is on the X axis. The data is about the number of tweets over hours, over the span of few days. Since the data is huge, the X axis scale becomes invisile. Below is the snippet…
Malathy
  • 63
  • 7
0
votes
1 answer

Looking for a more efficiency pandas code use BLS data set

Looking for a more effective way to prep data for Kmeans analysis. Using the BLS ( Bureau of Labor Statistics ) and trying to learn Kmeans, I am doing the first pass of the data and want to add two columns, percentage of change over time per in…
Joe Larson
  • 53
  • 1
  • 7
0
votes
0 answers

Pandas for comparing data from database (mysql) tables and csvs

I am looking for an efficient way of comparing data from database (mysql) tables and csvs. Preference is to use dataframes to compare the data and find out any "missing updates" in database. CSVs range from 40MB to 3.5GBs and tables can be up-to…
smoshah
  • 55
  • 1
  • 9
0
votes
1 answer

Best way of parsing huge (10GB+) JSON files

I would like to know what is the best tool, IDE, programming language for parsing data stored as a json file. I trying pandas in python and ff in R and both of them either crash for memory issues or take too long to process. Do you have experience…
BlueMountain
  • 197
  • 2
  • 17
0
votes
1 answer

Is there a simpler way to merge results of describe() from multiple chunks of a DataFrame?

I am working on a large csv files. Since I cannot import the whole csv file into a dataframe at the same time due to memory limitations, I am using chunks to process the data. df = pd.read_csv(filepath, chunksize = chunksize) for chunk in df: …
newBie
  • 55
  • 1
  • 10
0
votes
0 answers

For large datasets in Python, how do I find the nearest location using longitude and latitude?

I have a pandas dataframe containing 500.000(!) rows (locations) and two columns: Longitude Latitude Now I want a third column: Nearest location This column should tell me which row/location is nearest to the 'current' row/location. I know you…
LVDW
  • 11
  • 4
0
votes
2 answers

How to copy table from a database on SERVER 1 to another database on SERVER 2 using phpMyAdmin?

I have a database named db_x on server X (running WHM) and another database named db_y on server Y. I connected to server X via SSH made some changes to the phpmyadmin configurations to allow it to connect to db_y via phpmyadmin on server X via…
GAURAV KUMAR JHA
  • 190
  • 1
  • 11
0
votes
0 answers

I want to accept an input size of around 10 ^ 5. But the problem is my python code only takes 4096

Example lets say of python:- if __name__ == '__main__': q = int(input()) for _ in range(q): string = input() # assume string's length is more than 5000 print(len(string)) # this displays 4096 no matter how…
Kaustubh.P.N
  • 229
  • 1
  • 8
0
votes
1 answer

Insert automatically large amount of records in postgresSQL table

i need to populate my table randomly with large amount of record in PostgresSQL like 200k CREATE TABLE qr_code.tbl_transaction ( transaction_id varchar NOT NULL, importo numeric NOT NULL, alias varchar NOT NULL, order_id varchar NOT…
resla95
  • 1,017
  • 2
  • 11
  • 18
0
votes
1 answer

Purging numpy.memmap

Given a numpy.memmap object created with mode='r' (i.e. read-only), is there a way to force it to purge all loaded pages out of physical RAM, without deleting the object itself? In other words, I'd like the reference to the memmap instance to remain…
NPE
  • 486,780
  • 108
  • 951
  • 1,012
0
votes
1 answer

How do I efficiently process a large dataframe row-by-row?

I have a large dataframe (10,000,000+ rows) that I would like to process. I'm also fairly new to R, and want to better understand how to work with large datasets like this. I have a formula that I want to apply to each row in the dataframe. But I've…
Andrew
  • 85
  • 2
  • 6
0
votes
2 answers

Is there any way to split a SAS file of around 16GB into multiple files/dataframes in Python?

I have a raw SAS file that is around 16GB, and even after keeping columns relevant to my problem, the file size comes to around 8GB. It kind of looks like this: CUST_ID FIELD_1 FIELD_2 FIELD_3 ... FIELD_7 1 65 786 ABC …
0
votes
0 answers

How to /In which program process a large amount of data(observations)?

I have a project where my data universe is charged in Oracle SQL Developerand it's around 86 millions of rows, of course I want to apply a mothodology in order to reduce the number of observations such as clusters and in order to get that I use a…
Salma Guzmán
  • 13
  • 1
  • 4
0
votes
1 answer

Is there any something like shared-memory in Dask for large object multiprocessing job?

In a regression test, I got a 1000*100000 pandas dataframe like this: df=pd.DataFrame(np.random.random((1000,100))) The first column is y label, the others is x1-x99. I need to pick out three or seven var-x to fit y , run each regression, get all…
WilsonF
  • 85
  • 6