Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions

votes

1 answer

separating data from 1 column into 2

I have data columns in this format Ardencaple Mince(SD-818-2146-04) I want to separated it out into 2 new columns with Name and code. I tried variations of the separate function but couldn't get the desired result. Any suggestions for a quick…

data-cleaning

asked Mar 30 '23 at 08:16

KellyWhite

votes

1 answer

Moving from data preprocessing to a model and hyper parameter tuning

I am new to machine learning and I am having trouble with fitting a data set for a classification model. What I would like to know is after pre processing data and fitting to a model with just default hyper parameters, how much performance can I…

asked Mar 29 '23 at 15:03

Inuka Ampavila

votes

0 answers

Keeping equations while text cleaning

I have a JSON file like this: {"text": "ABCD (before the photon detections) becomes"}, {"text": "under the additional assumption that optimal distillation will occur for an initial symmetric state through symmetric local PS and squeezing operations,…

python text data-cleaning

asked Mar 29 '23 at 05:02

Elbek Keskinoglu

votes

0 answers

Cleaning inconsistent date character strings in R

In one of my datasets, there is a "test date" variable that is HORRIBLY messy. Seriously, this can not be understated. It is a character string, and the field from which it is pulled is open-text with little to no uniformity. Unfortunately, I have…

r date data-cleaning

asked Mar 27 '23 at 17:00

mrk

votes

2 answers

How can I process data with a lot of missing values because of different sources?

I have data that looks like this: Timestamp Variable A Variable B Variable C 2023-01-01 00:00:00.000 Value Nan Nan 2023-01-01 00:00:00.050 Value Nan Nan 2023-01-01 00:00:00.150 Nan Value Value 2023-01-01…

pandas machine-learning data-cleaning data-processing

asked Mar 27 '23 at 11:25

guir

votes

2 answers

selecting the highest variable in a column per identifier

I am having a little difficulty with something which I am sure better R users will find very easy. I have a dataframe arranged by identifiers on each row, however some have two rows per identifier. This is because the surveys were asked twice, and I…

r dplyr data-cleaning

asked Mar 27 '23 at 08:58

DW1310

votes

0 answers

Imputing for missing values in Uniform distribution or flat shape column of a dataframe

I have a dataset that has 13 columns and 2000 rows. One of the columns has 184 NaN values. On checking the box plot, there are no outliers, and the data of a column is uniformly distributed. My understanding says median and mode is not the correct…

data-cleaning

asked Mar 26 '23 at 18:47

saurabh

votes

1 answer

How to treat and transform a STRING with very complex date format into DATETIME format using SQL (BigQuery)

Within Google BigQuery (SQL) I have a table called Sales, with a column in STRING format called data_pt_filtro that has dates, but this column is all dirty and needs treatment. Below are some samples I took from this column: 1/1/2019 - 9:5 20/2/2019…

sql date datetime google-bigquery data-cleaning

asked Mar 25 '23 at 13:54

João Pedro Reis Silva

votes

0 answers

Fuzzy matching with bespoke function gives incorrect length of the output

In the code below, I have a function called correct_admin_names that takes in main_data, shapefile_data, and the variable they have in common, var_by. The goal of this function is to correct the variable they have in common in main_data by matching…

r tidyverse data-cleaning matching fuzzy

asked Mar 23 '23 at 04:22

Mohamed Yusuf

votes

1 answer

Is there way to flag if there is a missing column or columns in between non-missing columns in SAS?

I tried different ways and searched on google to come up with a way to deal with this. But was unable to accomplish this. Currently, my data looks like the below: data Have; input col1 $ col2 $ col3 $ col4 $ col5 $ col6 $ col7 $ col8 $ col9 $ col10…

sas data-cleaning

asked Mar 21 '23 at 15:54

Sai Paritala

votes

2 answers

Standardize mixed datetime format in pandas dataframe that includes strings

I have a dataset with mixed datetime formats and strings in the date columns. I am trying to standardize the date in the columns to a regular datetime format. I tried combining these solutions (Clean a Messy Date Column with Mixed Formats in…

python pandas numpy datetime data-cleaning

asked Mar 18 '23 at 15:29

user21407177

votes

2 answers

How to find and flag if there are values after a 'certain value' horizontally

data Have; input col1 $ col2 $ col3 $ col4 $ col5 $ col6 $ col7 $ col8 $ col9 $ col10 $; cards; PM MM JM MM PM PB . . PM . PM MM JM MM PM PB JM . . . PM MM JM MM PM MB PM MM . . PM MM JM MM PM PM MM MB . . PM MM JM MM PM PM MM PB . . ; Hello all, I…

sas data-cleaning

asked Mar 17 '23 at 15:52

Sai Paritala

votes

0 answers

How to get back categorical data after imputing using LabelEncoder + Iterative Imputer?

I am trying to impute missing values for a categorical column of data, i have successfully imputed them but now i want to change them back to categorical, how to do that? i have used labelencoder and iterativeImputer i have done this, import numpy…

machine-learning data-science data-cleaning missing-data imputation

asked Mar 17 '23 at 10:47

Saideva Sathvik Ravula

votes

0 answers

Remove Constants Columns in data frame using R

I have a df with 160 columns, This is a sample of the type of df df = data.frame (id= c('u1', 'u1', 'u1', 'u2', 'u2'), var1= c('e1','e2','e3','e3','e4'), var2= c('a1','a1','a1','a1','a1'), …

r dataframe dplyr tidyr data-cleaning

asked Mar 17 '23 at 00:34

danny

votes

2 answers

Is there a way in SAS to flag records that have values after a 'certain value' across columns?

data have; length col1-col10 $2.; input col1-col10; datalines; PM MM JM MM PM MM MM PB . . MM JM PM MB . . . . . . MM MM PM JM PB MM . . . . PM PM PM MM MB MM JM . . . ; run; Goal: Hi all, My goal here is that I want to flag all…

sas data-cleaning

asked Mar 16 '23 at 16:30

Sai Paritala

Prev 1 2 3

…

99 100 Next