Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions
0
votes
0 answers

Cleaning Unstructured PDF data

Raw Data: Given is a PDF data containing the student placement details of a university. It is in a completely unstructured form and needs to be cleaned up before processing. The Expected CSV file output: I tried importing the pdf from inside an…
0
votes
0 answers

How to measure data quality?

I have a question regarding data quality. I am aware of the data quality dimensions, but I'd like to be able to measure data quality in numbers. For example how many NULL are acceptable in a column e.g. 2%, 5%, 10% etc. I know every data set is…
ninelondon
  • 97
  • 6
0
votes
1 answer

How to save the changes after a loop in python?

for i in data['test preparation course']: if i == 'none': i = None here, I'm trying to convert 'none' string with None values in python, and it went well. I just want to activate the changes on the dataset
0
votes
1 answer

Cannot Create xts Object from Data.Frame that has Properly Formatted Date Index

frame that should be properly formatting with a date index that's properly formatted to be an .xts. However, I cannot complete the conversion to an xts object and receive the following error: Error in xts(master_zillow5) : order.by requires an…
js80
  • 385
  • 2
  • 11
0
votes
1 answer

Coerce Matrix Character Values to Numeric Values

I have a matrix of historical home values that I'm analyzing and after I create a matrix to by transposing the object, I cannot convert the matrix's values to numeric rather than characters. I also want to preserve the row and column indices. …
js80
  • 385
  • 2
  • 11
0
votes
0 answers

Handling large gaps for time series forecasting (TFT model)

I have an hourly time series data which contains both short and large missing gaps. For small gaps I could use linear interpolation technique to fill the missing points but I would like to learn what are the best practices to fill the large gaps? My…
0
votes
1 answer

Remove unwanted symbols using regex in dictionary

I have a dataset that contains a list for each rows. I want to remove unwanted symbols in my dataset and replace the data that contains [] with none. But, the coding is not working. Here the coding for the data cleaning. def clean_data(data): …
leeteerah
  • 43
  • 5
0
votes
1 answer

Import DataFrame XLS with variable structure with Python

I received a couple of days ago a data set that is somewhat difficult to deal with, the only thing fixed that I see in this data set is that the records as such always start in row 9 and the names of the columns in row 7. As shows the picture…
danny
  • 45
  • 4
0
votes
2 answers

Remove specific words from a column r

I'm looking to remove specific words (for example "co" "INC" etc) from a column in data without removing the same letters from other words in the same column. In other words, I only want to remove these words when they are free standing. This is a…
bear_525
  • 41
  • 5
0
votes
1 answer

R group by with each grouped element associated with most common factor

I want to group by column a and choose the most common factor b for each unique a. For example: tibble(a = c(1,1,1,2,2,2), b = factor(c('cat', 'dog', 'cat', 'cat', 'dog', 'dog'))) %>% reframe(b = most_common(b), .by = a) I want this to…
at.
  • 50,922
  • 104
  • 292
  • 461
0
votes
3 answers

How do I drop specific NaN values while keeping others based on a pattern of NaNs in the dataframe in Pandas?

I have a dataframe like this: The pattern it follows is that if the first n rows of Column A are populated, then the next n rows will be "populated" (populated cells could have NaN values as well, but I call them "populated" because I'd like to…
A OP
  • 11
  • 1
0
votes
2 answers

Using for loop and mutate to create variables

I have a dataset that has a column for the type and a column for the amount. I'm trying to clean the data so that it is one column for each possible type with the amount as the value in that column. My data looks like this: df <- data.frame( …
0
votes
1 answer

Keep the first 4 words in a column

I'm trying to only keep the first 4 words of a column in my data and still want to keep the other observations that have less than 4 words. This is a sample of what some of the data looks like. State Company Number of workers X FAIRFIELD…
bear_525
  • 41
  • 5
0
votes
0 answers

Filtering a data frame containing worldwide data and filtering only US states?

I have a dataframe that contains the cols State which has the name of states and n number of date cols for confirmed covid cases. How can i filter the data frame so that i only have 50 states, the US territories, the District of Columbia, and the US…
0
votes
1 answer

Variable selection in big data

I am trying to build a regression model for big data with 220 variables. The 220 variables have binary values with values as zero and one. Some variables are correlated (not highly correlated). Also, some of the variables have 60% or more of their…