Questions tagged [data-processing]

Data Processing concerns the converting of raw data to machine-readable form and its subsequent processing (as storing, updating, rearranging, or printing out) by a computer.

Data Processing concerns the converting of raw data to machine-readable form and its subsequent processing (as storing, updating, rearranging, or printing out) by a computer.

More Info

909 questions
6
votes
2 answers

Stored Procedure or Code

I am not asking for opinions but more on documentations. We have a lot of data files (XML, CSV, Plantext, etc...), and need to process them, data mine them. The lead database person suggested using stored procedure to accomplish the task. Basically…
5
votes
1 answer

I receive a MemError when I interpolate data with generator expressions and iterators from a regular grid on a mesh >2000 times

this is my first question here at stackoverflow, because I started scripting with Python3. Application I made a Python3 script that writes the load definition of a moveable heat source for a finite element simulation in LS-Dyna. As source I have a…
5
votes
2 answers

Lexicon dictionary for synonym words

There are few dictionaries available for natural language processing. Like positive, negative words dictionaries etc. Is there any dictionary available which contains list of synonym for all dictionary words? Like for nice synonyms: enjoyable,…
5
votes
3 answers

Intensive file I/O and data processing in C#

I'm writing an app which needs to process a large text file (comma-separated with several different types of records - I do not have the power or inclination to change the data storage format). It reads in records (often all the records in the file…
We Are All Monica
  • 13,000
  • 8
  • 46
  • 72
5
votes
2 answers

CPU bound applications vs. IO bound

For 'number-crunching' style applications that use alot of data (reads: "hundreds of MB, but not into GB" ie, it will fit nicely into memory beside the OS), does it make sense to read all your data into memory first before starting processing to…
Matthew Scharley
  • 127,823
  • 52
  • 194
  • 222
5
votes
5 answers

Processing a large amount of data in parallel

I'm a python developer with pretty good RDBMS experience. I need to process a fairly large amount of data (approx 500GB). The data is sitting in approximately 1200 csv files in s3 buckets. I have written a script in Python and can run it on a…
David S
  • 12,967
  • 12
  • 55
  • 93
5
votes
1 answer

Read in specific, pattern-matched rows from a file

I have a file that is tab-delimited and contains multiple tables each headed by a title, for example "Azuay\n", "Bolivar\n", "Cotopaxi\n", etc, and each table separated by two newlines. Within R, how can I read in this file and select only the table…
Kaleb
  • 1,022
  • 1
  • 15
  • 26
4
votes
2 answers

How to track number of distinct values incrementally from a spark table?

Suppose we have a very large table that we'd like to process statistics for incrementally. Date Amount Customer 2022-12-20 30 Mary 2022-12-21 12 Mary 2022-12-20 12 Bob 2022-12-21 15 Bob 2022-12-22 15 Alice We'd like to be able to…
4
votes
2 answers

Filter rows of 1st Dataframe from the 2nd Dataframe having different starting dates

I have two dataframes from which a new dataframe has to be created. The first one is given below. data = {'ID':['A', 'A', 'A', 'A', 'A', 'B','B','B','B', 'C','C','C','C','C','C', 'D','D','D'], 'Date':['2021-2-13', '2021-2-14', '2021-2-15',…
Shiva
  • 212
  • 2
  • 11
4
votes
2 answers

Interpolation / stretching out of values in vector to a specified length

I have vectors of different length For example, a1 = c(1,2,3,4,5,6,7,8,9,10) a2 = c(1,3,4,5) a3 = c(1,2,5,6,9) I want to stretch out a2 and a3 to the length of a1, so I can run some algorithms on it that requires the lengths of the vectors to be the…
4
votes
3 answers

What is the optimal way to process a very large (over 30GB) text file and also show progress

[newbie question] Hi, I'm working on a huge text file which is well over 30GB. I have to do some processing on each line and then write it to a db in JSON format. When I read the file and loop using "for" my computer crashes and displays blue…
Raj K
  • 43
  • 3
4
votes
1 answer

How to perform offline image augmentation using Keras?

I want to perform offline image augmentation for different image classes in my dataset and save the images to one of the folders before I start creating the model. Using Keras ImageDataGenerator - flow_from_directory() which has save_to_dir and…
k92
  • 375
  • 3
  • 15
4
votes
1 answer

Side inputs vs normal constructor parameters in Apache Beam

I have a general question on side inputs and broadcasting in the context of Apache Beam. Does any additional variables, lists, maps that are need for computation during processElement, need to be passed as side input? Is it ok if they are passed as…
4
votes
1 answer

Numpy - Normalize RGB image dataset

My dataset is a Numpy array with dimensions (N, W, H, C), where N is the number of images, H and W are height and width respectively and C is the number of channels. I know that there are many tools out there but I would like to normalize the images…
cmplx96
  • 1,541
  • 8
  • 34
  • 48
4
votes
1 answer

How to smooth a curve with large noise which is only in certain part?

I'd like to smooth a scatter plot shown below (the points are very dense), and the data is here. There is large noise in the middle of the curve, and I'd like to smooth the curve, also the y value should monotonically increase. Since there are…
Tom
  • 191
  • 1
  • 11
1 2
3
60 61