Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions

votes

4 answers

Last name, First Name to First Name Last Name

I have a set of names in last, first format Name Pos Team Week.x Year.x GID.x h.a.x Oppt.x Week1Points DK.salary.x Week.y Year.y GID.y 1 Abdullah, Ameer RB det 1 2015 2995 a sdg 19.4 4000 2 2015 …

r data-cleaning

asked Nov 20 '15 at 12:23

JB17

votes

4 answers

Fuzzy data matching for personal demographic information

Let's say I have a database filled with people with the following data elements: PersonID (meaningless surrogate autonumber) FirstName MiddleInitial LastName NameSuffix DateOfBirth AlternateID (like an SSN, Militarty ID, etc.) I get lots of data…

c# .net algorithm string-matching data-cleaning

asked Jul 16 '10 at 13:56

mattmc3

17,595
7
83
103

votes

4 answers

Best way to clean and normalise large amount of data relying on string matching algorithm

I am currently working on a data modelling project as part of my university summer project. Client data needs a lot of cleaning, since a number of columns rely on human input and have free text. To give an example, the column Business Name has…

algorithm machine-learning string-matching data-cleaning

asked Jul 12 '15 at 10:51

Sharingan

votes

3 answers

cleaning the data-frame with conditions

I am trying to clean a dataframe by deleting the wrongly added rows. This is the dummy data : temp <- structure(list(Date = c("24/06/2002", "24/06/2002", "25/06/2002","25/06/2002", "26/06/2002", …

r dataframe data-cleaning

asked Oct 16 '22 at 10:45

Bella_18

votes

4 answers

Pandas replace string values in a column which has multiple variations

I am working with this csv file. It's a small dataset of laptop information. laptops = pd.read_csv('laptops.csv',encoding="Latin-1") laptops["Operating System"].value_counts() Windows 1125 No OS 66 Linux 62 Chrome OS …

python pandas dataframe numpy data-cleaning

asked Jun 02 '22 at 16:59

Ravindra S

6,302
12
70
108

votes

5 answers

Break Apart a String into Separate Columns R

I am trying to tidy up some data that is all contained in 1 column called "game_info" as a string. This data contains college basketball upcoming game data, with the Date, Time, Team IDs, Team Names, etc. Ideally each one of those would be their own…

r regex dplyr data-manipulation data-cleaning

asked Dec 16 '21 at 14:51

bodega18

votes

1 answer

How to make new columns out of every second row in a pandas df

I have a data frame for NBA data that I am having a hard time manipulating. I would like to change df1 to df2 by having both teams and their scores in a game along the same row twice to resemble the games outcome from both teams'…

python pandas data-manipulation data-cleaning

asked May 05 '21 at 14:20

NickA

votes

3 answers

Splitting a cell in pandas into multiple rows

It was a bit tricky to explain the problem. I want to split a cell containing multiple string values delimited by commas into different rows. The df below is a small example but the real dataset contains up to 15 columns and 15 rows and each cell…

python pandas dataframe data-science data-cleaning

asked Mar 30 '21 at 06:06

user14318465

votes

4 answers

How can I count the number of floats or integers in a column of a Pandas dataframe?

I have a Pandas Dataframe and there is a column which is of float type. But numbers with decimals don't make sense for this column. So I want to find out how many floats are in this column and after that I want to delete the whole row where I have a…

python pandas google-colaboratory data-cleaning

asked Mar 06 '21 at 15:43

akfin

votes

4 answers

Top "n" rows of each group using dplyr -- with different number per group

I'll use the built-in chickwts data as an example. Here's the data, there are 5 feed types. > head(chickwts) weight feed 1 179 horsebean 2 160 horsebean 3 136 horsebean 4 227 horsebean 5 217 horsebean 6 168 horsebean >…

r dplyr data-cleaning data-wrangling

asked Dec 02 '20 at 05:28

max

4,141
5
26
55

votes

2 answers

python dataframe income column cleanup

This maybe a simple solution, but I am finding it hard to make this function work for my dataset. I have a salary column with variety of data in it. Example dataframe below: ID Income desired Output 1 26000…

python pandas data-cleaning

asked Nov 23 '20 at 03:02

DarkKnight

votes

3 answers

Remove Row if NaN in First Five Columns

I have a pandas dataframe with dimensions 89 rows by 13 columns. I want to remove an entire row if NaN appears within the first five columns. Here is an example. LotName C15 C16 C17 C18 C19 Spots15 Spots16 ... Cherry St 439 464 555 …

python python-3.x pandas nan data-cleaning

asked Oct 25 '20 at 19:43

325

votes

1 answer

Pandas convert list of list to columns names and append values

I have to columns in pandas dataframe, one with keys second with values, where both are list of lists. Like this: import pandas as pd example = pd.DataFrame( {'col1': [['key1','key2','key3'],['key1','key4'],['key1', 'key3', 'key4','key5']], 'col2':…

python pandas dataframe data-cleaning

asked Aug 06 '20 at 18:12

Michał Gosk

votes

2 answers

Interpolation / stretching out of values in vector to a specified length

I have vectors of different length For example, a1 = c(1,2,3,4,5,6,7,8,9,10) a2 = c(1,3,4,5) a3 = c(1,2,5,6,9) I want to stretch out a2 and a3 to the length of a1, so I can run some algorithms on it that requires the lengths of the vectors to be the…

r interpolation data-manipulation data-cleaning data-processing

asked Jun 22 '20 at 05:17

mexicanseafood

votes

2 answers

Using the dplyr mutate function to replace multiple values

In the following data the levels for both variables are coded numerically dat = read.csv("https://studio.edx.org/c4x/HarvardX/PH525.1x/asset/assoctest.csv") head(dat) I am replacing these codes with character strings to make for easier reading and…

r replace data-cleaning dplyr

asked Jun 18 '20 at 09:55

Robert

Prev 1 2 3

…

99 100 Next