Questions tagged [data-preprocessing]

Preprocessing can be the structuring from raw data and cleaning to be actually usable up to transforming data so that it can be handled by algorithms or improve their results. Preferably also tags for specific methods should be used. This tag should be used for meaningful preprocessing steps in a data pipeline, prior to algorithms or as a standalone method.

Data preprocessing is applicable to multiple stages in which data can persist. This can be on a higher level right before more meaningful processing steps like analysis takes place.
But preprocessing also starts when raw data is generated and must be brought into a meaningful and usable format. Currently the tag fits this lower level description better, likewise if the structure of how the data is stored and queried is important. Finding errors, missing values and how to handle them can are also major part of it. For that prefer to use the tag and/or .

This tag should focus more on the rearrangement and transformation of data to be usable by algorithms or improve their results. Examples for preprocessing are encoding of data, their scaling or normalization of a already formatted dataset.

Preprocessing algorithms and techniques can be found in scikit-learn modules Preprocessing and Normalization:

Further theory and examples for the necessity of data preprocessing is discussed in section scikit-learn - Preprocessing data.

488 questions
3
votes
2 answers

Convert dataframe column string values into dummy variable columns

I have the following dataframe (excluded rest of columns): | customer_id | department | | ----------- | ----------------------------- | | 11 | ['nail', 'men_skincare'] | | 23 | ['nail', 'fragrance'] …
3
votes
1 answer

One hot coding in Train Validation and Test set (Production data)

For example I have below train set. name values 0 Tony 100 1 Smith 110 2 Sam 120 3 Shane 130 4 Sam 140 5 Ram 160 After one hot encoding it becomes values 0 1 2 3 4 0 100 1 …
3
votes
2 answers

Python Pandas: Drop rows from data frame if list of string value == [none]

I have a column in my data frame that contains lists of values. Tags [marvel, comics, comic, books, nerdy] [new, snapchat, version, snap, inc] [none] [new, york, times, ny, times, nyt, times] [today, show, today, show, today] [none] [mark,…
Amal Nasir
  • 164
  • 15
2
votes
1 answer

Machine learning: working with array of objects in preprocessing

We have been facing a problem in preprocessing for our project that some columns contain an array of objects ( dictionaries ) like that Column A Column B movie1 [{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "zh", "name":…
2
votes
1 answer

How can I train GPT-3 with my own company data using OpenAI's API?

I want to train GPT-3 with my company's data to perform specific NLP tasks using OpenAI's API. How can I train the GPT-3 model with my own data? What kind of data preprocessing do I need to perform before training the model? Are there any Python…
jazz
  • 31
  • 6
2
votes
1 answer

Normalize a time stamp data

I have a large set of data which is in the form of of numeric data type which defines time in 24 hour format in HHMM form. Since the data type is numeric, the preceding zeroes are absent. A sample of the data can be found here: >…
driver
  • 273
  • 1
  • 13
2
votes
1 answer

Processing multiple columns in the dataset into one column for modeling

I want to predict spatio-temporal data and I found STNN (Spatio Temporal Neural Network) research with the github repository here (https://github.com/edouardelasalles/stnn), at the end of the repo description, it is explained regarding the dataset…
2
votes
1 answer

Preprocessing layers with seed not producing the same data augmentation for images and masks

I'm trying to create a simple preprocessing augmentation layer, following this Tensorflow tutorial. I created this 'simple' example that shows the problem I'm having. Even though I'm initializing the augmentation class with a seed, operations…
2
votes
2 answers

Calculate by how much a row has shifted horizontally in pandas dataframe

I have a dataframe where the rows have been shifted horizontally by an unknown amount. Each and every row has shifted by a different amount as shown below: Heading 1 Heading 2 Unnamed: 1 Unnamed:…
2
votes
1 answer

How to fill null values of a feature present in polars dataframe with median values of the feature?

I'm a pandas user but due to the advantages of polars dataframes over pandas, i tried switching to polars. When I did the switching, I encountered this problem of not knowing how to fill the null values of a feature with it median values based on…
RKCH
  • 219
  • 3
  • 9
2
votes
3 answers

pandas function to check if there exist non-NA values for the same ids?

Assume I have a dataset that contains around 100 000 rows and 50 columns. I have information about the sellers and their products. The part of the dataset will look somehow like…
Stuck
  • 45
  • 5
2
votes
0 answers

How to preprocess data for Kalman filter

I am reading through a Kalman filter techniques and thinking about how to use them but I am not sure if I understand the whole process in using the measured data in Kalman data-step. Lets assume that you have accelerometer and you want to estimate…
Marxwil
  • 21
  • 3
2
votes
2 answers

Preprocessing of rows of a DataFrame by numeric characters of specified size

Let it be the following Python Panda DataFrame: NAME NUM_OWNERS NUM_DOCS NUM_RESIDENTS Total 23900137 21028886 44571130.0 Macael-04062 366607 …
Carola
  • 366
  • 4
  • 18
2
votes
1 answer

is there a way, to count all rows which contain at least one '1' in a dataframe checking multiple named columns?

I have a dataset filled with Medicare beneficiaries. The question is: 'What proportion of patients have at least one of the chronic conditions described in the independent variables alzheimers, arthritis, cancer, copd, depression, diabetes,…
musti
  • 23
  • 6
2
votes
1 answer

How to pre-process month abbreviation for modelling

Given the below column: col 0 NaN 1 Jan,Apr,Jul,Oct 2 Jan,Jun,Jul 3 Apr,May,Oct,Nov 4 NaN How to convert the month abbreviation into integer data that can be fed to the model?
lima0
  • 111
  • 6
1
2
3
32 33