Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions

votes

2 answers

In R Convert to Date of several columns?

Can you give me a hand with the code below? I did try to find an answer to this but migth have missed, if there is one already sorry for your time. I have a DataFrame like the exemple below. What I need to do is to convert all dt_ variables to…

r date dplyr data-cleaning

asked Jun 11 '20 at 21:51

Yuri

votes

5 answers

Replace values outside range with NA using replace_with_na function

I have the following dataset structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12, NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA, -6L)) a b c 1 2 4 50 2 1 5 34 3 9 1 77 4 2 9 88 5 9 12 33 6 8 NA…

r dataframe replace na data-cleaning

asked Jun 02 '20 at 10:19

Maya

votes

2 answers

Extract phone number from noised string

I have a column in a table that contains random data along with phone numbers in different formats. The column may contain Name Phone Email HTML tags Addresses (with numbers) Examples: 1) Call back from +79005346546, Conversation…

sql sql-server regex database data-cleaning

asked Feb 21 '20 at 06:12

kseen

votes

1 answer

Need an efficient way in R to convert coloured utf-8 emoji characters to their default skin

Is there any efficient way to get rid of colored emojis from vectors and make them to their standard form? Please see two output for instance, I maybe not using appropriate terms. Currently I am doing like…

r unicode utf-8 emoji data-cleaning

asked Jan 15 '20 at 23:31

CaseebRamos

votes

1 answer

Data Cleaning (Addresses) Python

I'm looking to clean a dataset with 61k rows. I need to clean its street address column. Presently, the addresses are a nightmare. Sometimes full addresses are written out (i.e. 111 Frederick Douglass Blvd) other times the same address will be…

python pandas data-cleaning

asked Nov 18 '19 at 02:20

ynnad

votes

1 answer

How to remove error values in large df with 1000 columns

I have a large dataset with more than 1000 columns, the dataset is messy with mixed dtypes. There are 2 int64 columns, 119 float columns and 1266 object columns. I would like to begin data cleaning but realised there are several issues. As there are…

python pandas dataframe data-cleaning

asked Nov 01 '19 at 07:05

wjie08

votes

1 answer

Replace the # values present in a column in pandas dataframe with auto-incremental values by rows

The scenario is: VendorID column contains '#' in all rows of pandas dataframe. I have been trying to substitute the value of '#' in VendorID column to the auto increment row number value. I was trying str.replace() function :…

python pandas data-cleaning

asked Aug 27 '19 at 04:34

Sri2110

votes

2 answers

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' when printing in UTF-8 locale

I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gzfile (I downloaded using wget). I want to extract the text and see how it…

python python-3.x character-encoding data-cleaning french

asked Jul 25 '19 at 08:01

Sophil

votes

1 answer

How to use the pandas.melt() while keeping the NaN values?

I`m cleaning a messy data frame where some of the information needed appear in the column names. This information should melt to a single column that would be created. index name animal fruit …

python pandas dataframe data-cleaning

asked Feb 18 '19 at 18:09

Ahmad Ali

votes

5 answers

Python Summing up Rows in Dataframe with the same Key

I want to sum up rows in a dataframe which have the same row key. The purpose will be to shrink the data set size down. For example if the data frame looks like this. Fruit Count Apple 10 Pear 20 Apple 5 Banana …

python pandas numpy statistics data-cleaning

asked Feb 05 '19 at 03:01

mrsquid

votes

2 answers

Retrieving the row values in pandas

I have a dataframe which have two columns countries data United states of america(USA) 1 india13 2 i want to get the data from row in this format countries data United states…

python pandas data-cleaning

asked Feb 03 '19 at 18:42

surya

votes

1 answer

Large data set cleaning: How to fill in missing data based on multiple categories and searching by row order

This is my first StackOverflow post, so I hope that it isn't too difficult to understand. I have a large dataset (~14,000) rows of bird observations. These data were collected by standing in one place (point) and counting birds that you see within 3…

r if-statement data-manipulation data-cleaning

asked Sep 12 '18 at 17:20

Dylan_Gomes

2,066
14
29

votes

0 answers

create sub-index of groups with pandas and groupby

I have a dataframe that has an ID column and I would like to add a column to the dataframe is an index for each unique ID. I was able to do this using 2 for-loops with the example below by making a list from the ID count, converting it to an array,…

python pandas dataframe data-cleaning

asked Sep 11 '18 at 07:29

zipline86

votes

1 answer

Messy date formats in data frame

I created a task for myself that I cannot solve - there is a dataframe with start dates and end dates of some projects. Some elements are wrong and show the duration of a project instead of the end date. start_date <- c("2017-05-04",…

r dataframe data-cleaning

asked Aug 12 '18 at 07:02

LuckyLuck

votes

6 answers

How do I only keep observations based on the max values after their decimal point?

I want to make this dataframe: (edited to show that it's an actual data frame with more than 1 column) ID = c(100.00, 100.12, 100.36, 101.00, 102.00, 102.24, 103.00, 103.36, 103.90) blood = c(55, 54, 74, 42, 54, 45, 65, 34, 44) df = data.frame(ID,…

r data-cleaning

asked Jul 23 '18 at 20:03

StatsNTats

Prev 1 2 3

…

99 100 Next