Questions tagged [data-processing]

Data Processing concerns the converting of raw data to machine-readable form and its subsequent processing (as storing, updating, rearranging, or printing out) by a computer.

Data Processing concerns the converting of raw data to machine-readable form and its subsequent processing (as storing, updating, rearranging, or printing out) by a computer.

More Info

909 questions
87
votes
3 answers

Large scale data processing Hbase vs Cassandra

I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis. While both are same key/value storage and both are/can run…
Gary Lindahl
  • 5,341
  • 2
  • 19
  • 18
70
votes
5 answers

Best way to format large JSON file? (~30 mb)

I need to format a large JSON file for readability, but every resource I've found (mostly online) doesn't deal with data say, above 1-2 MB. I need to format about 30 MB. Is there any way to do this, or any way to code something to do this?
covariance
  • 6,833
  • 7
  • 23
  • 24
49
votes
6 answers

how to use pandas filter with IQR

Is there a built-in way to do filtering on a column by IQR(i.e. values between Q1-1.5IQR and Q3+1.5IQR)? also, any other possible generalized filtering in pandas suggested will be appreciated.
Qijun Liu
  • 1,685
  • 1
  • 13
  • 11
44
votes
4 answers

Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?)

I have some very big delimited data files and I want to process only certain columns in R without taking the time and memory to create a data.frame for the whole file. The only options I know of are read.table which is very wasteful when I only want…
Alex Stoddard
  • 8,244
  • 4
  • 41
  • 61
36
votes
4 answers

Lua vs Embedded Lisp and potential other candidates. for set based data processing

Current Choice: lua-jit. Impressive benchmarks, I am getting used to the syntax. Writing a high performance ABI will require careful consideration on how I will structure my C++. Other Questions of interest Gambit-C and Guile as embeddable…
Hassan Syed
  • 20,075
  • 11
  • 87
  • 171
27
votes
3 answers

Handling missing/incomplete data in R--is there function to mask but not remove NAs?

As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance: Many R functions have an na.rm flag that when set to TRUE, remove the NAs: >>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67),…
doug
  • 69,080
  • 24
  • 165
  • 199
26
votes
3 answers

What is the difference between mini-batch vs real time streaming in practice (not theory)?

What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives…
19
votes
14 answers

Algorithm for grouping anagram words

Given a set of words, we need to find the anagram words and display each category alone using the best algorithm. input: man car kile arc none like output: man car arc kile like none The best solution I am developing now is based on an hashtable,…
Ahmed
  • 7,148
  • 12
  • 57
  • 96
18
votes
5 answers

frameworks for representing data processing as a pipeline

Most data processing can be envisioned as a pipeline of components, the output of one feeding into the input of another. A typical processing pipeline is: reader | handler | writer As a foil for starting this discussion, let's consider an…
ErikR
  • 51,541
  • 9
  • 73
  • 124
10
votes
4 answers

Most efficient way to use a large data set for PyTorch?

Perhaps this question has been asked before, but I'm having trouble finding relevant info for my situation. I'm using PyTorch to create a CNN for regression with image data. I don't have a formal, academic programming background, so many of my…
Doug MacArthur
  • 125
  • 1
  • 2
  • 9
10
votes
5 answers

Hibernate out of memory exception while processing large collection of elements

I am trying to process collection of heavy weight elements (images). Size of collection varies between 8000 - 50000 entries. But for some reason after processing 1800-1900 entries my program falls with java.lang.OutOfMemoryError: Java heap space. In…
Yurii Bondarenko
  • 3,460
  • 6
  • 28
  • 47
9
votes
3 answers

Can I use Layer Normalization with CNN?

I see the Layer Normalization is the modern normalization method than Batch Normalization, and it is very simple to coding in Tensorflow. But I think the layer normalization is designed for RNN, and the batch normalization for CNN. Can I use the…
9
votes
1 answer

java framework for aggregation and sliding windows implementation

I have an event stream and a key-val storage. The value size is limited by 4Kb. The event rate is not very heavy - maximum hundreds a day. In this value I need to store a serialized representation of a data structure that provides an efficient…
aviad
  • 8,229
  • 9
  • 50
  • 98
8
votes
6 answers

free secure distributed make system for linux

Are there any good language-agnostic distributed make systems for linux that are secure and free? Background Information: I run scientific experiments (computer-science ones) that sometimes have large dependency trees, occasionally on the order of…
Mr Fooz
  • 109,094
  • 6
  • 73
  • 101
8
votes
2 answers

Python Pandas replace values by their opposite sign

I am trying to "clean" some data. I have values which are negative, which they cannot be. And I would like to replace all values that are negative to their corresponding positive values. A | B | C -1.9 | -0.2 | 'Hello' 1.2 | 0.3 |…
eleijonmarck
  • 4,732
  • 4
  • 22
  • 24
1
2 3
60 61