Questions tagged [data-preprocessing]

Preprocessing can be the structuring from raw data and cleaning to be actually usable up to transforming data so that it can be handled by algorithms or improve their results. Preferably also tags for specific methods should be used. This tag should be used for meaningful preprocessing steps in a data pipeline, prior to algorithms or as a standalone method.

Data preprocessing is applicable to multiple stages in which data can persist. This can be on a higher level right before more meaningful processing steps like analysis takes place.
But preprocessing also starts when raw data is generated and must be brought into a meaningful and usable format. Currently the tag fits this lower level description better, likewise if the structure of how the data is stored and queried is important. Finding errors, missing values and how to handle them can are also major part of it. For that prefer to use the tag and/or .

This tag should focus more on the rearrangement and transformation of data to be usable by algorithms or improve their results. Examples for preprocessing are encoding of data, their scaling or normalization of a already formatted dataset.

Preprocessing algorithms and techniques can be found in scikit-learn modules Preprocessing and Normalization:

Further theory and examples for the necessity of data preprocessing is discussed in section scikit-learn - Preprocessing data.

488 questions
2
votes
1 answer

Changing some values in a row of pd.DataFrame leads to SettingWithCopyWarning pandas python

I'm making stock app, and when get data and edit it, see this error. Code is simple: df = yf.download(ticker, period=period, interval='1wk', auto_adjust=True, threads=True) Here i get DataFrame like bellow: Open High …
alex-uarent-alex
  • 363
  • 1
  • 10
2
votes
2 answers

Polymorphic data transformation techniques / data lake/ big data

Background: We are working on a solution to ingest huge sets of telemetry data from various clients. The data is in xml format and contains multiple independent groups of information which have a lot of nested elements. Clients have different…
2
votes
0 answers

Implementation of two Sliding Windows over a multivariate sequence of data in python

I am trying to construct two sliding windows over a multivariate sequence of data (m*n). The first window should be fixed and the second one is rolling over the data samples. Both windows have the same size. I followed this postdistance…
2
votes
1 answer

Extracting Instrument Qualities From Audio Signal

I'm looking to write a function that takes an audio signal (assuming it contains a single instrument playing), out of which I would like to extract the instrument-like features out of the audio and into a vector space. So in theory, if I had two…
2
votes
0 answers

Incremental OneHotEncoding and Target Encoding

I am working with a large tabular dataset that consists of many categorical columns. I want to train a regression model (XGBoost) in this data while using as many regressors as possible. Because of the size of data, I am using incremental training -…
Petr
  • 1,606
  • 2
  • 14
  • 39
2
votes
2 answers

How to get validation set which has equal number of images for each class using tensorflow?

I'm now using CIFAR-100 dataset to train a model. I'd like to use 10% of train data as validation data. I used the codes below in the beginning. (train_images, train_labels), (test_images, test_labels) = datasets.cifar100.load_data() train_images,…
Janet
  • 21
  • 1
2
votes
1 answer

Implementing sklearn PCA on limited number of variables in a pipeline

I'm setting up a machine learning pipeline to classify some data. One source of the data is a very good candidate for PCA and makes up the last n dimensions of the dataset. I would like to use PCA on these variables but not the preceding variables.…
2
votes
1 answer

Using TF timeseries_dataset_from_array with more samples

I have to handle a huge amount of samples, where each sample contains unique time series. The goal is to feed this data into the Tensorflow LSTM model and predict some features. I have created the tf timeseries_dataset_from_array generator function…
2
votes
0 answers

R recipe packages - Remove outliers

Currently I write my master thesis and perform a data analysis with R. I decided to use the recipe package and to follow the approach of the following book: https://bradleyboehmke.github.io/HOML/. In this book (precisely in Chapter 3.5), outlier…
user304405
  • 33
  • 5
2
votes
0 answers

String cleaning/preprocessing for BERT

So my goal is to train a BERT Model on wikipedia data that I derive right from Wikipedia. The contents that I scrape from the site look like this (example): "(148975) 2001 XA255, provisional designation: 2001 XA255, is a dark minor planet in the…
2
votes
1 answer

How to map string to a int (class number) as a part of tf.data.Dataset .map() for preprocessing in Tensorflow?

I'm trying to create a parser function that reads an (image, label) pair from TFRecord. When the label is an int64, all works well, however when I try and save the label as a string and convert it to an int in the parser function, things break. I'm…
miluz
  • 1,353
  • 3
  • 14
  • 22
2
votes
1 answer

Keras preprocessing layer

I am trying to feed a neural network 50 features (All Yes/No values) to predict the probability of one Yes/No label. I am trying to do this with keras CategoryEncoding, but running into some issues. The start of my code is below: model =…
zeromodz15
  • 21
  • 2
2
votes
3 answers

How to assign a "reseting" group number by the second grouping variable in R?

My data looks like this: Measurement Compound Measure 1 A 111 1 A 222 1 B 333 1 B 444 2 C 555 2 C 666 2 D 777 2 D 888 And I'm trying to assign a "reseting" group number based on…
2
votes
2 answers

Pandas: Append copy of rows changing only values in multiple columns larger than max allowed to split bin values

Problem: I have a data frame that I need to modify based on the values of particular column. If value of any column value is greater than that of maximum allowed then a new row will be created based upon distribution into equally sized bins (taking…
Alpha
  • 399
  • 3
  • 9
2
votes
1 answer

During calculation of "distance average" in knn imputation method for replacing NaN value in particular column

I encounter this problem when I implement the Knn imputation method for handling missing data from scratch. I create a dummy dataset and find the nearest neighbors for rows that contain missing values here is my dataset A B C D …
1 2
3
32 33