Questions tagged [data-preprocessing]

Preprocessing can be the structuring from raw data and cleaning to be actually usable up to transforming data so that it can be handled by algorithms or improve their results. Preferably also tags for specific methods should be used. This tag should be used for meaningful preprocessing steps in a data pipeline, prior to algorithms or as a standalone method.

Data preprocessing is applicable to multiple stages in which data can persist. This can be on a higher level right before more meaningful processing steps like analysis takes place.
But preprocessing also starts when raw data is generated and must be brought into a meaningful and usable format. Currently the tag data-manipulation fits this lower level description better, likewise data-structures if the structure of how the data is stored and queried is important. Finding errors, missing values and how to handle them can are also major part of it. For that prefer to use the tag data-cleaning and/or data-wrangling.

This tag data-preprocessing should focus more on the rearrangement and transformation of data to be usable by algorithms or improve their results. Examples for preprocessing are encoding of data, their scaling or normalization of a already formatted dataset.

Preprocessing algorithms and techniques can be found in scikit-learn modules Preprocessing and Normalization:

Further theory and examples for the necessity of data preprocessing is discussed in section scikit-learn - Preprocessing data.

488 questions

votes

1 answer

Changing some values in a row of pd.DataFrame leads to SettingWithCopyWarning pandas python

I'm making stock app, and when get data and edit it, see this error. Code is simple: df = yf.download(ticker, period=period, interval='1wk', auto_adjust=True, threads=True) Here i get DataFrame like bellow: Open High …

python pandas dataframe data-preprocessing

asked May 25 '22 at 22:10

alex-uarent-alex

votes

2 answers

Polymorphic data transformation techniques / data lake/ big data

Background: We are working on a solution to ingest huge sets of telemetry data from various clients. The data is in xml format and contains multiple independent groups of information which have a lot of nested elements. Clients have different…

apache-spark bigdata databricks data-processing data-preprocessing

asked May 07 '22 at 09:29

Amin M

votes

0 answers

Implementation of two Sliding Windows over a multivariate sequence of data in python

I am trying to construct two sliding windows over a multivariate sequence of data (m*n). The first window should be fixed and the second one is rolling over the data samples. Both windows have the same size. I followed this postdistance…

python stream time-series sliding-window data-preprocessing

asked Mar 09 '22 at 02:57

A Sam

votes

1 answer

Extracting Instrument Qualities From Audio Signal

I'm looking to write a function that takes an audio signal (assuming it contains a single instrument playing), out of which I would like to extract the instrument-like features out of the audio and into a vector space. So in theory, if I had two…

data-extraction audio-processing data-preprocessing

asked Jan 24 '22 at 22:09

Ori Yonay

votes

0 answers

Incremental OneHotEncoding and Target Encoding

I am working with a large tabular dataset that consists of many categorical columns. I want to train a regression model (XGBoost) in this data while using as many regressors as possible. Because of the size of data, I am using incremental training -…

scikit-learn one-hot-encoding data-preprocessing

asked Jan 09 '22 at 12:48

Petr

1,606
2
14
39

votes

2 answers

How to get validation set which has equal number of images for each class using tensorflow?

I'm now using CIFAR-100 dataset to train a model. I'd like to use 10% of train data as validation data. I used the codes below in the beginning. (train_images, train_labels), (test_images, test_labels) = datasets.cifar100.load_data() train_images,…

python tensorflow validation data-preprocessing

asked Dec 16 '21 at 05:29

Janet

votes

1 answer

Implementing sklearn PCA on limited number of variables in a pipeline

I'm setting up a machine learning pipeline to classify some data. One source of the data is a very good candidate for PCA and makes up the last n dimensions of the dataset. I would like to use PCA on these variables but not the preceding variables.…

machine-learning pca dimensionality-reduction feature-engineering data-preprocessing

asked Dec 07 '21 at 09:07

A. Bollans

votes

1 answer

Using TF timeseries_dataset_from_array with more samples

I have to handle a huge amount of samples, where each sample contains unique time series. The goal is to feed this data into the Tensorflow LSTM model and predict some features. I have created the tf timeseries_dataset_from_array generator function…

python tensorflow time-series data-preprocessing

asked Dec 07 '21 at 19:35

Gábor Kőrösi

votes

0 answers

R recipe packages - Remove outliers

Currently I write my master thesis and perform a data analysis with R. I decided to use the recipe package and to follow the approach of the following book: https://bradleyboehmke.github.io/HOML/. In this book (precisely in Chapter 3.5), outlier…

r outliers recipe data-preprocessing

asked Dec 07 '21 at 12:06

user304405

votes

0 answers

String cleaning/preprocessing for BERT

So my goal is to train a BERT Model on wikipedia data that I derive right from Wikipedia. The contents that I scrape from the site look like this (example): "(148975) 2001 XA255, provisional designation: 2001 XA255, is a dark minor planet in the…

python deep-learning nlp bert-language-model data-preprocessing

asked Nov 22 '21 at 15:04

Heidedo

votes

1 answer

How to map string to a int (class number) as a part of tf.data.Dataset .map() for preprocessing in Tensorflow?

I'm trying to create a parser function that reads an (image, label) pair from TFRecord. When the label is an int64, all works well, however when I try and save the label as a string and convert it to an int in the parser function, things break. I'm…

python tensorflow data-preprocessing

asked Nov 14 '21 at 13:41

miluz

1,353
3
14
22

votes

1 answer

Keras preprocessing layer

I am trying to feed a neural network 50 features (All Yes/No values) to predict the probability of one Yes/No label. I am trying to do this with keras CategoryEncoding, but running into some issues. The start of my code is below: model =…

python tensorflow keras data-preprocessing

asked Nov 11 '21 at 17:56

zeromodz15

votes

3 answers

How to assign a "reseting" group number by the second grouping variable in R?

My data looks like this: Measurement Compound Measure 1 A 111 1 A 222 1 B 333 1 B 444 2 C 555 2 C 666 2 D 777 2 D 888 And I'm trying to assign a "reseting" group number based on…

r data-manipulation data-preprocessing

asked Nov 05 '21 at 07:39

Gorgonzola45

votes

2 answers

Pandas: Append copy of rows changing only values in multiple columns larger than max allowed to split bin values

Problem: I have a data frame that I need to modify based on the values of particular column. If value of any column value is greater than that of maximum allowed then a new row will be created based upon distribution into equally sized bins (taking…

python pandas data-preprocessing

asked Oct 25 '21 at 12:21

Alpha

votes

1 answer

During calculation of "distance average" in knn imputation method for replacing NaN value in particular column

I encounter this problem when I implement the Knn imputation method for handling missing data from scratch. I create a dummy dataset and find the nearest neighbors for rows that contain missing values here is my dataset A B C D …

dataframe machine-learning sklearn-pandas feature-engineering data-preprocessing

asked Aug 24 '21 at 04:20

CS Vyas

Prev 1 2

…

32 33 Next