Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions
6
votes
2 answers

python pandas: split comma-separated column into new columns - one per value

I have a dataframe like this: data = np.array([["userA","event2, event3"], ['userB',"event3, event4"], ['userC',"event2"]]) data = pd.DataFrame(data) 0 1 0 userA "event2, event3" 1 userB "event3,…
funkfux
  • 283
  • 3
  • 14
6
votes
5 answers

Python remove hashtag symbol and keep key words

I want to remove hashtag symbol ('#') and underscore that separate between words ('_') Example: "this tweet is example #key1_key2_key3" the result I want: "this tweet is example key1 key2 key3" My code using string : #Remove punctuation , # Hashtag…
Noura
  • 151
  • 1
  • 3
  • 10
6
votes
2 answers

R - select only factor columns of dataframe

I am trying to select only factor columns from my data frame. Example is below: bank[,apply(bank[,names(bank)!="id"], is.factor)] But the code behaves strangely. Step by step: sapply(bank[,names(bank)!="id"], is.factor) I get: age sex …
Maksim Khaitovich
  • 4,742
  • 7
  • 39
  • 70
6
votes
2 answers

How to replace outlier data in pandas?

I have a stock data grabbed from Yahoo finance, adjusted close data is wrong somehow. adj_close close ratio date 2014-10-16 240.4076 2466.40 0.097473 2014-10-17 245.8173 2521.90 …
Akash Chandra
  • 61
  • 1
  • 4
6
votes
1 answer

openrefine flag changed rows

I'm using openrefine to cleanup an excel data set. I have about 70 operations and I've been cutting and pasting on different data sets. I maintain a record id and export to a new excel sheet. Then I reload the sheet using the record id. It works…
Sonicthoughts
  • 548
  • 1
  • 4
  • 16
6
votes
3 answers

Determine if string format is "May 16, 2013" or UNIX Timestamp with Javascript

Doing some data wrangling with a large dataset. The data has a "date" field that randomly switches between a format like "1370039735000" and "May 16, 2013". So far I've converted other date fields with either new Date("May 16, 2013") or new…
Julian
  • 1,853
  • 5
  • 27
  • 48
6
votes
3 answers

Performing Operations on a Subset Using Data Table

I have a survey data set in wide form. For a particular question, a set of variables was created in the raw data to represent different the fact that the survey question was asked on a particular month. I wish to create a new set of variables that…
Andreas
  • 1,923
  • 19
  • 24
5
votes
3 answers

Transforming complete age from character to numeric in R

I have a dataset with people's complete age as strings (e.g., "10 years 8 months 23 days) in R, and I need to transform it into a numeric variable that makes sense. I'm thinking about converting it to how many days of age the person has (which is…
Ruam Pimentel
  • 1,288
  • 4
  • 16
5
votes
2 answers

Dealing with NaN (missing) values for Logistic Regression- Best practices?

I am working with a data-set of patient information and trying to calculate the Propensity Score from the data using MATLAB. After removing features with many missing values, I am still left with several missing (NaN) values. I get errors due to…
5
votes
2 answers

How to transfer negative value at current row to previous row in a data frame?

I want to transfer the negative values at the current row to the previous row by adding them to the previous row within each group. Following is the sample raw data I have: raw_data <- data.frame(GROUP = rep(c('A','B','C'),each = 6), …
5
votes
3 answers

Remove special characters from entire dataframe in R

Question: How can you use R to remove all special characters from a dataframe, quickly and efficiently? Progress: This SO post details how to remove special characters. I can apply the gsub function to single columns (images 1 and 2), but not the…
PizzaAndCode
  • 340
  • 1
  • 3
  • 12
5
votes
2 answers

Pandas - Remove strings from a float number in a column

I have a dataframe like the following: plan type hour status code A cont 0 ok 010.0 A cont 2 ok 025GWA A cont 0 notok 010VVT A cont 0 other 6.05 A vend 1 ok 6.01 The column code…
Thabra
  • 337
  • 2
  • 9
5
votes
2 answers

Using gsub() on a dataframe

I have a CSV datafile called test_20171122 Often, datasets that I work with were originally in Accounting or Currency format in Excel and later converted to a CSV file. I am looking into the optimal way to clean data from an accounting format…
Brandon
  • 59
  • 1
  • 1
  • 3
5
votes
2 answers

How to match a string and white space in R

I have a dataframe with columns having values like: "Average 18.24" "Error 23.34". My objective is to replace the text and following space from these. in R. Can any body help me with a regex pattern to do this? I am able to successfully do this…
duvvurum
  • 337
  • 2
  • 4
  • 9
5
votes
2 answers

How can I write an R script to check for straight-lining; i.e., whether, for any given row, all values in a set of columns have the same value

I would like to create a dichotomous variable that tells me whether a participant gave the same response to each of 10 questions. Each row is a participant and I want to write a simple script to create this new variable/vector in my data frame. …
Bofstein
  • 55
  • 1
  • 6