Questions tagged [data-preprocessing]

Preprocessing can be the structuring from raw data and cleaning to be actually usable up to transforming data so that it can be handled by algorithms or improve their results. Preferably also tags for specific methods should be used. This tag should be used for meaningful preprocessing steps in a data pipeline, prior to algorithms or as a standalone method.

Data preprocessing is applicable to multiple stages in which data can persist. This can be on a higher level right before more meaningful processing steps like analysis takes place.
But preprocessing also starts when raw data is generated and must be brought into a meaningful and usable format. Currently the tag fits this lower level description better, likewise if the structure of how the data is stored and queried is important. Finding errors, missing values and how to handle them can are also major part of it. For that prefer to use the tag and/or .

This tag should focus more on the rearrangement and transformation of data to be usable by algorithms or improve their results. Examples for preprocessing are encoding of data, their scaling or normalization of a already formatted dataset.

Preprocessing algorithms and techniques can be found in scikit-learn modules Preprocessing and Normalization:

Further theory and examples for the necessity of data preprocessing is discussed in section scikit-learn - Preprocessing data.

488 questions
-1
votes
3 answers

How to get index number of pandas data frame that contain only null values

Consider the below data frame. Dataframe I want to extract the ids of this dataframe that contain only null values. For example, id 1 has only null values. So the answer should be index 1. Can you please explain how to extract this? Please note that…
-1
votes
1 answer

How to extract multiple short video clips from a long video using python?

How to extract multiple smaller video clips from a long video using some python package, I need it as part of my video preprocessing for my project. ffmpeg is a method but its too complex. Any other method would be really helpful. I tried using…
-1
votes
1 answer

split column into multi dynamically in python or sql

I'm trying to Split the details column into multi using T-sql or python. the table is like this: ID Details 15 Hotel:Campsite;Message:Reservation…
A H.
  • 41
  • 5
-1
votes
1 answer

Multi text classification problem with more than 2000 class

I'm working in a project with more than 2000 class with 230000 rows , The dataset consists of two columns product name and category name. I applied NLP techniques to vectorize the texts and used linear svm to predict the category of the products and…
-1
votes
1 answer

error while removing the stop-words from the text

I am trying to remove stopwords from my data and I have used this statement to download the stopwords. stop = set(stopwords.words('english')) This has character 'd' as one of the stopwords. So, when I apply this to my function it is removing 'd'…
-1
votes
2 answers

How can i combine all the tokenized word to a sentence in a column?

How can I combine all the tokenized words into a sentence in a column? tokenized_word = ['really','smart','people'] in a sentence = really smart people
-1
votes
1 answer

How can i calculate the difference between values in different rows same column in Python?

I am dealing with a dataset of Nifty 2019 which has only two columns - Date and Close. I want to find the days where it was volatile (high > 105% of low). I am trying to shift the values, store them in a different place, and assign them to a…
-1
votes
1 answer

In machine learning, should I remove original features after a feature combination?

If I made a new feature (c) using two existing features (a,b) like c = a*b or a+b, should I remove the two originals? (to avoid the duplication problem?) Please, help me bro..
-1
votes
1 answer

Merge the csv files in GCP

The dataset on which I am working on GCP is in csv format and for each feature there is a separate csv file with no header. There is around 20 files and want to create a single file for all these variables with headers. However, I have access on the…
-1
votes
1 answer

if you have a numerical target of two classes 0 and 1 and all the features are numerical as well, should i encode the target?

I am working on a binary classification problem, my dataset contains numerical features and the target class as well is numerical where I have two classes either 0 or 1 in this case while preprocessing the dataset, should I go through the data…
sena
  • 1
  • 1
-1
votes
1 answer

Python how to correct a misaligned substring position info from string

I have a list of strings and the start offset and end offset of substrings that need to be used for training a nlp model. Some of these positions for substring are misaligned. Eg: text = 'Car is blue' start_offset = 0 end_offset = 2 …
nifeco
  • 211
  • 1
  • 8
-1
votes
1 answer

calculate mean value for the missing values

I need to calculate the mean of neighbor values to replace it with NaN value, but the problem is, I don't want to make my code more complicated. For example, I have there 20 countries and 4 car types from 2010 to 2020, but there are some missing…
Jason
  • 1
  • 1
-1
votes
1 answer

How to convert genre column to numerical value so that I can feed it to the neural network model?

In the image attached below, the genre column has multiple attributes for a single entry. I am trying to build a neural network model and for that I need to encode it. I am having problems regarding that.
-1
votes
1 answer

Subtracting a date in one column from previous unique date in another column in R

I have a dataframe as following: ID DPREL Dt_biop 292 2012-06-11 2014-03-06 292 2013-01-10 2014-03-06 292 2015-05-21 2014-03-06 292 2017-09-05 2014-03-06 292 2012-06-11 2015-05-21 292 2012-09-07 …
Afshin
  • 3
  • 2
-1
votes
1 answer

Data pre-processing and feature engineering

I have been doing some reading on data pre-processing and feature engineering including feature selection, feature importance and feature construction. My understanding is that Feature engineer is applied in data preprocessing stage. Additionally,…
1 2 3
32
33