Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions

votes

0 answers

Cleaning Unstructured PDF data

Raw Data: Given is a PDF data containing the student placement details of a university. It is in a completely unstructured form and needs to be cleaned up before processing. The Expected CSV file output: I tried importing the pdf from inside an…

asked May 17 '23 at 14:41

gurukishoreg78

votes

0 answers

How to measure data quality?

I have a question regarding data quality. I am aware of the data quality dimensions, but I'd like to be able to measure data quality in numbers. For example how many NULL are acceptable in a column e.g. 2%, 5%, 10% etc. I know every data set is…

null data-cleaning data-quality

asked May 16 '23 at 11:39

ninelondon

votes

1 answer

How to save the changes after a loop in python?

for i in data['test preparation course']: if i == 'none': i = None here, I'm trying to convert 'none' string with None values in python, and it went well. I just want to activate the changes on the dataset

python pandas data-science data-analysis data-cleaning

asked May 13 '23 at 19:48

W.xtar777

votes

1 answer

Cannot Create xts Object from Data.Frame that has Properly Formatted Date Index

frame that should be properly formatting with a date index that's properly formatted to be an .xts. However, I cannot complete the conversion to an xts object and receive the following error: Error in xts(master_zillow5) : order.by requires an…

r datetime data-cleaning xts

asked May 12 '23 at 13:55

js80

votes

1 answer

Coerce Matrix Character Values to Numeric Values

I have a matrix of historical home values that I'm analyzing and after I create a matrix to by transposing the object, I cannot convert the matrix's values to numeric rather than characters. I also want to preserve the row and column indices. …

r dplyr data-cleaning

asked May 11 '23 at 23:05

js80

votes

0 answers

Handling large gaps for time series forecasting (TFT model)

I have an hourly time series data which contains both short and large missing gaps. For small gaps I could use linear interpolation technique to fill the missing points but I would like to learn what are the best practices to fill the large gaps? My…

time-series data-cleaning forecasting data-preprocessing

asked May 11 '23 at 15:23

Anita

votes

1 answer

Remove unwanted symbols using regex in dictionary

I have a dataset that contains a list for each rows. I want to remove unwanted symbols in my dataset and replace the data that contains [] with none. But, the coding is not working. Here the coding for the data cleaning. def clean_data(data): …

python pandas dictionary data-cleaning

asked May 11 '23 at 02:05

leeteerah

votes

1 answer

Import DataFrame XLS with variable structure with Python

I received a couple of days ago a data set that is somewhat difficult to deal with, the only thing fixed that I see in this data set is that the records as such always start in row 9 and the names of the columns in row 7. As shows the picture…

python pandas loops data-cleaning data-conversion

asked May 04 '23 at 02:08

danny

votes

2 answers

Remove specific words from a column r

I'm looking to remove specific words (for example "co" "INC" etc) from a column in data without removing the same letters from other words in the same column. In other words, I only want to remove these words when they are free standing. This is a…

r string data-cleaning

asked May 02 '23 at 23:16

bear_525

votes

1 answer

R group by with each grouped element associated with most common factor

I want to group by column a and choose the most common factor b for each unique a. For example: tibble(a = c(1,1,1,2,2,2), b = factor(c('cat', 'dog', 'cat', 'cat', 'dog', 'dog'))) %>% reframe(b = most_common(b), .by = a) I want this to…

r dplyr group-by data-cleaning r-factor

asked May 01 '23 at 21:00

at.

50,922
104
292
461

votes

3 answers

How do I drop specific NaN values while keeping others based on a pattern of NaNs in the dataframe in Pandas?

I have a dataframe like this: The pattern it follows is that if the first n rows of Column A are populated, then the next n rows will be "populated" (populated cells could have NaN values as well, but I call them "populated" because I'd like to…

python pandas dataframe nan data-cleaning

asked May 01 '23 at 09:44

A OP

votes

2 answers

Using for loop and mutate to create variables

I have a dataset that has a column for the type and a column for the amount. I'm trying to clean the data so that it is one column for each possible type with the amount as the value in that column. My data looks like this: df <- data.frame( …

r data-cleaning

asked Apr 28 '23 at 18:32

uncertaincyclist

votes

1 answer

Keep the first 4 words in a column

I'm trying to only keep the first 4 words of a column in my data and still want to keep the other observations that have less than 4 words. This is a sample of what some of the data looks like. State Company Number of workers X FAIRFIELD…

r string data-cleaning stringr

asked Apr 27 '23 at 20:59

bear_525

votes

0 answers

Filtering a data frame containing worldwide data and filtering only US states?

I have a dataframe that contains the cols State which has the name of states and n number of date cols for confirmed covid cases. How can i filter the data frame so that i only have 50 states, the US territories, the District of Columbia, and the US…

data-manipulation data-cleaning data-preprocessing data-filtering

asked Apr 25 '23 at 17:31

KuroiJukai

votes

1 answer

Variable selection in big data

I am trying to build a regression model for big data with 220 variables. The 220 variables have binary values with values as zero and one. Some variables are correlated (not highly correlated). Also, some of the variables have 60% or more of their…

machine-learning data-cleaning variable-selection

asked Apr 25 '23 at 16:26

Soodi

Prev 1 2 3

…

99 100 Next