Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions

votes

7 answers

How to remove a character from some rows in a dataframe column?

I have a large DataFrame that I need to clean, as a sample please look at this dataframe: import pandas as pd cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'], 'Price':…

asked Mar 19 '21 at 13:50

sam_sam

votes

8 answers

Blocking '0000-00-00' from MySQL Date Fields

I have a database where old code likes to insert '0000-00-00' in Date and DateTime columns instead of a real date. So I have the following two questions: Is there anything that I could do on the db level to block this? I know that I can set a…

mysql datetime date data-cleaning

asked Oct 08 '10 at 15:12

David

votes

5 answers

How to remove carriage return in a dataframe

I am having a dataframe that contains columns named id, country_name, location and total_deaths. While doing data cleaning process, I came across a value in a row that has '\r' attached. Once I complete cleaning process, I store the resulting…

python pandas replace carriage-return data-cleaning

asked May 11 '16 at 11:13

Saranya

votes

2 answers

Filter pandas data frame for col == None

I have a data frame data_df with multiple columns, one of which is c which holds country names. How do I filter out the rows where c == None. My first attempt was to do this: countries_df = data_df[data_df.c != None] However, that yielded 0 rows.…

python pandas subset data-cleaning

asked Oct 08 '14 at 05:08

kk415kk

1,227
1
14
30

votes

3 answers

cleaning data with dropna in Pyspark

I'm still relatively new to Pyspark. I use version 2.1.0. I'm trying to clean some data on a much larger data set. I've successfully used several techniques such as "dropDuplicates" along with subsets and sql functions (distinct, count etc). I then…

pyspark data-cleaning

asked Apr 08 '17 at 15:52

alortimor

votes

2 answers

R: Deleting rows based on a value in a column from a large data set in R

I am working in R on data set of 104500 observations. I want to delete rows based on a column name "state" that has values "TX" and "NY". I am using the following code customers <- customers[customers$State != "TX"] I'm getting the following…

r data-cleaning

asked Oct 04 '17 at 21:09

deadpool

votes

3 answers

Looping grepl() through data.table (R)

I have a dataset stored as a data.table DT that looks like this: print(DT) category industry 1: administration admin 2: nurse practitioner truck 3: trucking truck 4: administration admin 5: warehousing …

regex r data.table data-cleaning

asked Nov 13 '15 at 18:04

grrothman

votes

5 answers

Splitting a single column into multiple observation using R

I am working on HCUP data and this has range of values in one single column that needs to be split into multiple columns. Below is the HCUP data frame for reference : code label 61000-61003 excision of CNS 0169T-0169T ventricular…

r data.table medical data-cleaning splitstackshape

asked Oct 13 '15 at 21:50

x1carbon

votes

5 answers

How to loop through columns, check if a particular value exists in any of the columns, mutate a new column and enter 1 if it exists, 0 if not?

I am working on a research project, and one of the tables is entered in a way that is not quite suitable for analysis yet, so I am trying to reorganize it. Currently, each row is a test-taker, and each column is a question they have answered…

r dataframe dplyr data-cleaning

asked Sep 18 '19 at 15:37

Johe Houston

votes

1 answer

'float' object has no attribute 'strip'

I want to clean one column of my df['emp_length'] [shown in the screen shot]1 but when I use df_10v['emp_length'] = df_10v['emp_length'].map(lambda x: x.lstrip('<').rstrip('+')) to remove thing i dont want. It gave me an error: 'float' object has…

python pandas dataframe strip data-cleaning

asked Aug 22 '17 at 19:19

Pumpkin C

1,452
6
21
27

votes

1 answer

Python Pandas -- Forward filling entire rows with value of one previous column

New to pandas development. How do I forward fill a DataFrame with the value contained in one previously seen column? Self-contained example: import pandas as pd import numpy as np O = [1, np.nan, 5, np.nan] H = [5, np.nan, 5, np.nan] L = [1, np.nan,…

python pandas dataframe data-science data-cleaning

asked Jul 23 '17 at 06:05

Malachai

votes

2 answers

Impute via fill-forward/LOCF a column over a range of sequential rows in SQL?

When formatting data for time-series analysis, a common need is to impute missing values by filling-forward values over time (also called Last-Observation-Carried-Forward / LOCF ). While data-analysis environments typically provide that…

sql time-series data-warehouse data-cleaning

asked Apr 13 '15 at 18:27

DavidJ

4,369
4
26
42

votes

3 answers

Clean R data frame so that in a column no row value is bigger than 2 times next row value

I have a data frame exemplified by the following dist <- c(1.1,1.0,10.0,5.0,2.1,12.2,3.3,3.4) id <- rep("A",length(dist)) df<-cbind.data.frame(id,dist) df id dist 1 A 1.1 2 A 1.0 3 A 10.0 4 A 5.0 5 A 2.1 6 A 12.2 7 A 3.3 8 A 3.4 I…

r dataframe data-manipulation data-cleaning

asked Jan 29 '15 at 17:21

Kristian

votes

6 answers

Pivot dataframe to keep column headings and sub-headings in R

I am trying to pivot a table that has headings and sub-headings, so that the headings go into a column "date", and the subheadings are two columns instead of repeating. Here is an example of my data. This was produced using dput(), so while in the…

r pivot pivot-table tidyr data-cleaning

asked Jan 05 '22 at 20:25

cgxytf

votes

3 answers

Summarizing data by name separated across multiple variables

I'm trying to count totals for goals, primary assists, and secondary assists for each player. My problem is that I can't get my head around the logic to do that, as the data I want to summarize by (player name) is listed across three variables…

r dplyr data-cleaning

asked Jul 18 '18 at 15:10

Evan O.

1,553
2
11
20

Prev 1

…

99 100 Next