Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions
5
votes
4 answers

Last name, First Name to First Name Last Name

I have a set of names in last, first format Name Pos Team Week.x Year.x GID.x h.a.x Oppt.x Week1Points DK.salary.x Week.y Year.y GID.y 1 Abdullah, Ameer RB det 1 2015 2995 a sdg 19.4 4000 2 2015 …
JB17
  • 65
  • 1
  • 8
5
votes
4 answers

Fuzzy data matching for personal demographic information

Let's say I have a database filled with people with the following data elements: PersonID (meaningless surrogate autonumber) FirstName MiddleInitial LastName NameSuffix DateOfBirth AlternateID (like an SSN, Militarty ID, etc.) I get lots of data…
mattmc3
  • 17,595
  • 7
  • 83
  • 103
5
votes
4 answers

Best way to clean and normalise large amount of data relying on string matching algorithm

I am currently working on a data modelling project as part of my university summer project. Client data needs a lot of cleaning, since a number of columns rely on human input and have free text. To give an example, the column Business Name has…
4
votes
3 answers

cleaning the data-frame with conditions

I am trying to clean a dataframe by deleting the wrongly added rows. This is the dummy data : temp <- structure(list(Date = c("24/06/2002", "24/06/2002", "25/06/2002","25/06/2002", "26/06/2002", …
Bella_18
  • 624
  • 1
  • 14
4
votes
4 answers

Pandas replace string values in a column which has multiple variations

I am working with this csv file. It's a small dataset of laptop information. laptops = pd.read_csv('laptops.csv',encoding="Latin-1") laptops["Operating System"].value_counts() Windows 1125 No OS 66 Linux 62 Chrome OS …
Ravindra S
  • 6,302
  • 12
  • 70
  • 108
4
votes
5 answers

Break Apart a String into Separate Columns R

I am trying to tidy up some data that is all contained in 1 column called "game_info" as a string. This data contains college basketball upcoming game data, with the Date, Time, Team IDs, Team Names, etc. Ideally each one of those would be their own…
bodega18
  • 596
  • 2
  • 13
4
votes
1 answer

How to make new columns out of every second row in a pandas df

I have a data frame for NBA data that I am having a hard time manipulating. I would like to change df1 to df2 by having both teams and their scores in a game along the same row twice to resemble the games outcome from both teams'…
NickA
  • 433
  • 2
  • 10
4
votes
3 answers

Splitting a cell in pandas into multiple rows

It was a bit tricky to explain the problem. I want to split a cell containing multiple string values delimited by commas into different rows. The df below is a small example but the real dataset contains up to 15 columns and 15 rows and each cell…
4
votes
4 answers

How can I count the number of floats or integers in a column of a Pandas dataframe?

I have a Pandas Dataframe and there is a column which is of float type. But numbers with decimals don't make sense for this column. So I want to find out how many floats are in this column and after that I want to delete the whole row where I have a…
akfin
  • 43
  • 1
  • 4
4
votes
4 answers

Top "n" rows of each group using dplyr -- with different number per group

I'll use the built-in chickwts data as an example. Here's the data, there are 5 feed types. > head(chickwts) weight feed 1 179 horsebean 2 160 horsebean 3 136 horsebean 4 227 horsebean 5 217 horsebean 6 168 horsebean >…
max
  • 4,141
  • 5
  • 26
  • 55
4
votes
2 answers

python dataframe income column cleanup

This maybe a simple solution, but I am finding it hard to make this function work for my dataset. I have a salary column with variety of data in it. Example dataframe below: ID Income desired Output 1 26000…
DarkKnight
  • 57
  • 5
4
votes
3 answers

Remove Row if NaN in First Five Columns

I have a pandas dataframe with dimensions 89 rows by 13 columns. I want to remove an entire row if NaN appears within the first five columns. Here is an example. LotName C15 C16 C17 C18 C19 Spots15 Spots16 ... Cherry St 439 464 555 …
325
  • 594
  • 8
  • 21
4
votes
1 answer

Pandas convert list of list to columns names and append values

I have to columns in pandas dataframe, one with keys second with values, where both are list of lists. Like this: import pandas as pd example = pd.DataFrame( {'col1': [['key1','key2','key3'],['key1','key4'],['key1', 'key3', 'key4','key5']], 'col2':…
4
votes
2 answers

Interpolation / stretching out of values in vector to a specified length

I have vectors of different length For example, a1 = c(1,2,3,4,5,6,7,8,9,10) a2 = c(1,3,4,5) a3 = c(1,2,5,6,9) I want to stretch out a2 and a3 to the length of a1, so I can run some algorithms on it that requires the lengths of the vectors to be the…
4
votes
2 answers

Using the dplyr mutate function to replace multiple values

In the following data the levels for both variables are coded numerically dat = read.csv("https://studio.edx.org/c4x/HarvardX/PH525.1x/asset/assoctest.csv") head(dat) I am replacing these codes with character strings to make for easier reading and…
Robert
  • 141
  • 2
  • 6