Questions tagged [data-preprocessing]

Preprocessing can be the structuring from raw data and cleaning to be actually usable up to transforming data so that it can be handled by algorithms or improve their results. Preferably also tags for specific methods should be used. This tag should be used for meaningful preprocessing steps in a data pipeline, prior to algorithms or as a standalone method.

Data preprocessing is applicable to multiple stages in which data can persist. This can be on a higher level right before more meaningful processing steps like analysis takes place.
But preprocessing also starts when raw data is generated and must be brought into a meaningful and usable format. Currently the tag fits this lower level description better, likewise if the structure of how the data is stored and queried is important. Finding errors, missing values and how to handle them can are also major part of it. For that prefer to use the tag and/or .

This tag should focus more on the rearrangement and transformation of data to be usable by algorithms or improve their results. Examples for preprocessing are encoding of data, their scaling or normalization of a already formatted dataset.

Preprocessing algorithms and techniques can be found in scikit-learn modules Preprocessing and Normalization:

Further theory and examples for the necessity of data preprocessing is discussed in section scikit-learn - Preprocessing data.

488 questions
-1
votes
1 answer

convert multiple coulmns into the categories of one column in pandas

This is a dataset which is converted using one hot encoding, 0 means no and 1 means yes data: ID Red Blue Green Yellow Orange 1001 1 0 1 0 1 1002 0 1 0 1 0 1003 0 0 0 1 1 1004 0 0 0 0 0 1005 1 0 0 1 0 How to convert the above one…
-2
votes
0 answers

Specifying the columns using strings is only supported for pandas DataFrames and numpy.ndarray' object has no attribute 'columns'

This is my first machine learning project and the first time that I use ColumnTransformer. My aim is to perform two steps of data preprocessing, and use ColumnTransformer for each of them. In the first step, i use FunctionTransformer() and put it in…
-2
votes
1 answer

How to remove characters that repeat more than twice in a row/together in a string using python?

How can we reduce a string like haaaaaaapppppyyyyyy to haappyy Such that repetition is allowed to a maximum of twice in a row for a character in a string? including any character ( special characters also ) converting --------------------- to --
-2
votes
1 answer

Python: get the max value with the location above and below than the max

If I have a dataframe like this, index User Value location 1 1 1.0 4.5 2 1 1.5 5.2 3 1 3.0 7.0 4 1 2.5 7.5 5 2 1.0 11.5 6 2 1.25 14.1 7 2 2.0 …
Amal Nasir
  • 164
  • 15
-2
votes
1 answer

Data Preprocessing for KNN in python

preprocessing take a lot of time-consuming to understand, tuple, list, float, array structure. I have data that looks like
NIrbhay Mathur
  • 153
  • 1
  • 1
  • 10
-3
votes
0 answers

Is decision tree a right model to create a graph recommendation model to plot .csv files?

I am working on a project to build a graph recommendation model using machine learning algorithms, but I'm facing challenges with data preprocessing and algorithm selection. My goal is to recommend appropriate graphs when a .csv file is provided as…
-3
votes
2 answers

NameError: name 'data' is not defined (Python)

I am running this code in Python, I don't know what its always error import string import nltk from sklearn.pipeline import Pipeline import pandas as pd import numpy as np import re data = pd.read_csv(r'C:\Users\Prihantoro Tri…
1 2 3
32
33