Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions

votes

2 answers

python pandas: split comma-separated column into new columns - one per value

I have a dataframe like this: data = np.array([["userA","event2, event3"], ['userB',"event3, event4"], ['userC',"event2"]]) data = pd.DataFrame(data) 0 1 0 userA "event2, event3" 1 userB "event3,…

python pandas data-cleaning

asked Feb 16 '18 at 08:56

funkfux

votes

5 answers

Python remove hashtag symbol and keep key words

I want to remove hashtag symbol ('#') and underscore that separate between words ('_') Example: "this tweet is example #key1_key2_key3" the result I want: "this tweet is example key1 key2 key3" My code using string : #Remove punctuation , # Hashtag…

python data-cleaning

asked Feb 08 '18 at 08:45

Noura

votes

2 answers

R - select only factor columns of dataframe

I am trying to select only factor columns from my data frame. Example is below: bank[,apply(bank[,names(bank)!="id"], is.factor)] But the code behaves strangely. Step by step: sapply(bank[,names(bank)!="id"], is.factor) I get: age sex …

r dataframe data-science data-cleaning

asked Mar 31 '17 at 19:07

Maksim Khaitovich

4,742
7
39
70

votes

2 answers

How to replace outlier data in pandas?

I have a stock data grabbed from Yahoo finance, adjusted close data is wrong somehow. adj_close close ratio date 2014-10-16 240.4076 2466.40 0.097473 2014-10-17 245.8173 2521.90 …

python pandas data-cleaning

asked Nov 20 '16 at 06:17

Akash Chandra

votes

1 answer

openrefine flag changed rows

I'm using openrefine to cleanup an excel data set. I have about 70 operations and I've been cutting and pasting on different data sets. I maintain a record id and export to a new excel sheet. Then I reload the sheet using the record id. It works…

data-cleaning openrefine opendata

asked May 07 '14 at 17:41

Sonicthoughts

votes

3 answers

Determine if string format is "May 16, 2013" or UNIX Timestamp with Javascript

Doing some data wrangling with a large dataset. The data has a "date" field that randomly switches between a format like "1370039735000" and "May 16, 2013". So far I've converted other date fields with either new Date("May 16, 2013") or new…

javascript regex date data-cleaning

asked Dec 27 '13 at 19:33

Julian

1,853
5
27
48

votes

3 answers

Performing Operations on a Subset Using Data Table

I have a survey data set in wide form. For a particular question, a set of variables was created in the raw data to represent different the fact that the survey question was asked on a particular month. I wish to create a new set of variables that…

r data.table plyr data-cleaning

asked Apr 22 '13 at 18:05

Andreas

1,923
19
24

votes

3 answers

Transforming complete age from character to numeric in R

I have a dataset with people's complete age as strings (e.g., "10 years 8 months 23 days) in R, and I need to transform it into a numeric variable that makes sense. I'm thinking about converting it to how many days of age the person has (which is…

r data-cleaning lubridate stringr data-wrangling

asked Dec 01 '21 at 20:59

Ruam Pimentel

1,288
4
16

votes

2 answers

Dealing with NaN (missing) values for Logistic Regression- Best practices?

I am working with a data-set of patient information and trying to calculate the Propensity Score from the data using MATLAB. After removing features with many missing values, I am still left with several missing (NaN) values. I get errors due to…

machine-learning nan logistic-regression missing-data data-cleaning

asked Oct 02 '18 at 00:48

stats_nerd

votes

2 answers

How to transfer negative value at current row to previous row in a data frame?

I want to transfer the negative values at the current row to the previous row by adding them to the previous row within each group. Following is the sample raw data I have: raw_data <- data.frame(GROUP = rep(c('A','B','C'),each = 6), …

r dataframe dplyr data.table data-cleaning

asked Aug 26 '18 at 08:05

siddhesh tiwari

votes

3 answers

Remove special characters from entire dataframe in R

Question: How can you use R to remove all special characters from a dataframe, quickly and efficiently? Progress: This SO post details how to remove special characters. I can apply the gsub function to single columns (images 1 and 2), but not the…

r data-science data-cleaning

asked Apr 17 '18 at 20:18

PizzaAndCode

votes

2 answers

Pandas - Remove strings from a float number in a column

I have a dataframe like the following: plan type hour status code A cont 0 ok 010.0 A cont 2 ok 025GWA A cont 0 notok 010VVT A cont 0 other 6.05 A vend 1 ok 6.01 The column code…

python pandas data-cleaning

asked Jan 18 '18 at 10:55

Thabra

votes

2 answers

Using gsub() on a dataframe

I have a CSV datafile called test_20171122 Often, datasets that I work with were originally in Accounting or Currency format in Excel and later converted to a CSV file. I am looking into the optimal way to clean data from an accounting format…

r dataframe formatting gsub data-cleaning

asked Nov 22 '17 at 22:48

Brandon

votes

2 answers

How to match a string and white space in R

I have a dataframe with columns having values like: "Average 18.24" "Error 23.34". My objective is to replace the text and following space from these. in R. Can any body help me with a regex pattern to do this? I am able to successfully do this…

regex r data-cleaning

asked Jul 15 '16 at 09:24

duvvurum

votes

2 answers

How can I write an R script to check for straight-lining; i.e., whether, for any given row, all values in a set of columns have the same value

I would like to create a dichotomous variable that tells me whether a participant gave the same response to each of 10 questions. Each row is a participant and I want to write a simple script to create this new variable/vector in my data frame. …

r dataset data-cleaning logic

asked Jun 22 '16 at 23:50

Bofstein

Prev 1 2

…

99 100 Next