Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions
0
votes
1 answer

separating data from 1 column into 2

I have data columns in this format Ardencaple Mince(SD-818-2146-04) I want to separated it out into 2 new columns with Name and code. I tried variations of the separate function but couldn't get the desired result. Any suggestions for a quick…
0
votes
1 answer

Moving from data preprocessing to a model and hyper parameter tuning

I am new to machine learning and I am having trouble with fitting a data set for a classification model. What I would like to know is after pre processing data and fitting to a model with just default hyper parameters, how much performance can I…
0
votes
0 answers

Keeping equations while text cleaning

I have a JSON file like this: {"text": "ABCD (before the photon detections) becomes"}, {"text": "under the additional assumption that optimal distillation will occur for an initial symmetric state through symmetric local PS and squeezing operations,…
0
votes
0 answers

Cleaning inconsistent date character strings in R

In one of my datasets, there is a "test date" variable that is HORRIBLY messy. Seriously, this can not be understated. It is a character string, and the field from which it is pulled is open-text with little to no uniformity. Unfortunately, I have…
mrk
  • 1
  • 1
0
votes
2 answers

How can I process data with a lot of missing values because of different sources?

I have data that looks like this: Timestamp Variable A Variable B Variable C 2023-01-01 00:00:00.000 Value Nan Nan 2023-01-01 00:00:00.050 Value Nan Nan 2023-01-01 00:00:00.150 Nan Value Value 2023-01-01…
0
votes
2 answers

selecting the highest variable in a column per identifier

I am having a little difficulty with something which I am sure better R users will find very easy. I have a dataframe arranged by identifiers on each row, however some have two rows per identifier. This is because the surveys were asked twice, and I…
DW1310
  • 147
  • 7
0
votes
0 answers

Imputing for missing values in Uniform distribution or flat shape column of a dataframe

I have a dataset that has 13 columns and 2000 rows. One of the columns has 184 NaN values. On checking the box plot, there are no outliers, and the data of a column is uniformly distributed. My understanding says median and mode is not the correct…
saurabh
  • 71
  • 1
  • 4
0
votes
1 answer

How to treat and transform a STRING with very complex date format into DATETIME format using SQL (BigQuery)

Within Google BigQuery (SQL) I have a table called Sales, with a column in STRING format called data_pt_filtro that has dates, but this column is all dirty and needs treatment. Below are some samples I took from this column: 1/1/2019 - 9:5 20/2/2019…
0
votes
0 answers

Fuzzy matching with bespoke function gives incorrect length of the output

In the code below, I have a function called correct_admin_names that takes in main_data, shapefile_data, and the variable they have in common, var_by. The goal of this function is to correct the variable they have in common in main_data by matching…
Mohamed Yusuf
  • 390
  • 1
  • 11
0
votes
1 answer

Is there way to flag if there is a missing column or columns in between non-missing columns in SAS?

I tried different ways and searched on google to come up with a way to deal with this. But was unable to accomplish this. Currently, my data looks like the below: data Have; input col1 $ col2 $ col3 $ col4 $ col5 $ col6 $ col7 $ col8 $ col9 $ col10…
0
votes
2 answers

Standardize mixed datetime format in pandas dataframe that includes strings

I have a dataset with mixed datetime formats and strings in the date columns. I am trying to standardize the date in the columns to a regular datetime format. I tried combining these solutions (Clean a Messy Date Column with Mixed Formats in…
user21407177
0
votes
2 answers

How to find and flag if there are values after a 'certain value' horizontally

data Have; input col1 $ col2 $ col3 $ col4 $ col5 $ col6 $ col7 $ col8 $ col9 $ col10 $; cards; PM MM JM MM PM PB . . PM . PM MM JM MM PM PB JM . . . PM MM JM MM PM MB PM MM . . PM MM JM MM PM PM MM MB . . PM MM JM MM PM PM MM PB . . ; Hello all, I…
0
votes
0 answers

How to get back categorical data after imputing using LabelEncoder + Iterative Imputer?

I am trying to impute missing values for a categorical column of data, i have successfully imputed them but now i want to change them back to categorical, how to do that? i have used labelencoder and iterativeImputer i have done this, import numpy…
0
votes
0 answers

Remove Constants Columns in data frame using R

I have a df with 160 columns, This is a sample of the type of df df = data.frame (id= c('u1', 'u1', 'u1', 'u2', 'u2'), var1= c('e1','e2','e3','e3','e4'), var2= c('a1','a1','a1','a1','a1'), …
danny
  • 45
  • 4
0
votes
2 answers

Is there a way in SAS to flag records that have values after a 'certain value' across columns?

data have; length col1-col10 $2.; input col1-col10; datalines; PM MM JM MM PM MM MM PB . . MM JM PM MB . . . . . . MM MM PM JM PB MM . . . . PM PM PM MM MB MM JM . . . ; run; Goal: Hi all, My goal here is that I want to flag all…