Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions
12
votes
7 answers

How to remove a character from some rows in a dataframe column?

I have a large DataFrame that I need to clean, as a sample please look at this dataframe: import pandas as pd cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Suzuki'], 'Price':…
sam_sam
  • 449
  • 1
  • 5
  • 16
12
votes
8 answers

Blocking '0000-00-00' from MySQL Date Fields

I have a database where old code likes to insert '0000-00-00' in Date and DateTime columns instead of a real date. So I have the following two questions: Is there anything that I could do on the db level to block this? I know that I can set a…
David
  • 407
  • 3
  • 8
  • 13
12
votes
5 answers

How to remove carriage return in a dataframe

I am having a dataframe that contains columns named id, country_name, location and total_deaths. While doing data cleaning process, I came across a value in a row that has '\r' attached. Once I complete cleaning process, I store the resulting…
Saranya
  • 786
  • 1
  • 6
  • 20
12
votes
2 answers

Filter pandas data frame for col == None

I have a data frame data_df with multiple columns, one of which is c which holds country names. How do I filter out the rows where c == None. My first attempt was to do this: countries_df = data_df[data_df.c != None] However, that yielded 0 rows.…
kk415kk
  • 1,227
  • 1
  • 14
  • 30
10
votes
3 answers

cleaning data with dropna in Pyspark

I'm still relatively new to Pyspark. I use version 2.1.0. I'm trying to clean some data on a much larger data set. I've successfully used several techniques such as "dropDuplicates" along with subsets and sql functions (distinct, count etc). I then…
alortimor
  • 351
  • 2
  • 7
  • 18
9
votes
2 answers

R: Deleting rows based on a value in a column from a large data set in R

I am working in R on data set of 104500 observations. I want to delete rows based on a column name "state" that has values "TX" and "NY". I am using the following code customers <- customers[customers$State != "TX"] I'm getting the following…
deadpool
  • 139
  • 1
  • 2
  • 7
8
votes
3 answers

Looping grepl() through data.table (R)

I have a dataset stored as a data.table DT that looks like this: print(DT) category industry 1: administration admin 2: nurse practitioner truck 3: trucking truck 4: administration admin 5: warehousing …
grrothman
  • 171
  • 1
  • 11
8
votes
5 answers

Splitting a single column into multiple observation using R

I am working on HCUP data and this has range of values in one single column that needs to be split into multiple columns. Below is the HCUP data frame for reference : code label 61000-61003 excision of CNS 0169T-0169T ventricular…
x1carbon
  • 287
  • 1
  • 15
7
votes
5 answers

How to loop through columns, check if a particular value exists in any of the columns, mutate a new column and enter 1 if it exists, 0 if not?

I am working on a research project, and one of the tables is entered in a way that is not quite suitable for analysis yet, so I am trying to reorganize it. Currently, each row is a test-taker, and each column is a question they have answered…
7
votes
1 answer

'float' object has no attribute 'strip'

I want to clean one column of my df['emp_length'] [shown in the screen shot]1 but when I use df_10v['emp_length'] = df_10v['emp_length'].map(lambda x: x.lstrip('<').rstrip('+')) to remove thing i dont want. It gave me an error: 'float' object has…
Pumpkin C
  • 1,452
  • 6
  • 21
  • 27
7
votes
1 answer

Python Pandas -- Forward filling entire rows with value of one previous column

New to pandas development. How do I forward fill a DataFrame with the value contained in one previously seen column? Self-contained example: import pandas as pd import numpy as np O = [1, np.nan, 5, np.nan] H = [5, np.nan, 5, np.nan] L = [1, np.nan,…
Malachai
  • 73
  • 1
  • 7
7
votes
2 answers

Impute via fill-forward/LOCF a column over a range of sequential rows in SQL?

When formatting data for time-series analysis, a common need is to impute missing values by filling-forward values over time (also called Last-Observation-Carried-Forward / LOCF ). While data-analysis environments typically provide that…
DavidJ
  • 4,369
  • 4
  • 26
  • 42
7
votes
3 answers

Clean R data frame so that in a column no row value is bigger than 2 times next row value

I have a data frame exemplified by the following dist <- c(1.1,1.0,10.0,5.0,2.1,12.2,3.3,3.4) id <- rep("A",length(dist)) df<-cbind.data.frame(id,dist) df id dist 1 A 1.1 2 A 1.0 3 A 10.0 4 A 5.0 5 A 2.1 6 A 12.2 7 A 3.3 8 A 3.4 I…
Kristian
  • 73
  • 3
6
votes
6 answers

Pivot dataframe to keep column headings and sub-headings in R

I am trying to pivot a table that has headings and sub-headings, so that the headings go into a column "date", and the subheadings are two columns instead of repeating. Here is an example of my data. This was produced using dput(), so while in the…
cgxytf
  • 421
  • 4
  • 11
6
votes
3 answers

Summarizing data by name separated across multiple variables

I'm trying to count totals for goals, primary assists, and secondary assists for each player. My problem is that I can't get my head around the logic to do that, as the data I want to summarize by (player name) is listed across three variables…
Evan O.
  • 1,553
  • 2
  • 11
  • 20