Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions
4
votes
2 answers

In R Convert to Date of several columns?

Can you give me a hand with the code below? I did try to find an answer to this but migth have missed, if there is one already sorry for your time. I have a DataFrame like the exemple below. What I need to do is to convert all dt_ variables to…
Yuri
  • 73
  • 6
4
votes
5 answers

Replace values outside range with NA using replace_with_na function

I have the following dataset structure(list(a = c(2, 1, 9, 2, 9, 8), b = c(4, 5, 1, 9, 12, NA), c = c(50, 34, 77, 88, 33, 60)), class = "data.frame", row.names = c(NA, -6L)) a b c 1 2 4 50 2 1 5 34 3 9 1 77 4 2 9 88 5 9 12 33 6 8 NA…
Maya
  • 579
  • 3
  • 12
4
votes
2 answers

Extract phone number from noised string

I have a column in a table that contains random data along with phone numbers in different formats. The column may contain Name Phone Email HTML tags Addresses (with numbers) Examples: 1) Call back from +79005346546, Conversation…
kseen
  • 359
  • 8
  • 56
  • 104
4
votes
1 answer

Need an efficient way in R to convert coloured utf-8 emoji characters to their default skin

Is there any efficient way to get rid of colored emojis from vectors and make them to their standard form? Please see two output for instance, I maybe not using appropriate terms. Currently I am doing like…
CaseebRamos
  • 684
  • 3
  • 18
4
votes
1 answer

Data Cleaning (Addresses) Python

I'm looking to clean a dataset with 61k rows. I need to clean its street address column. Presently, the addresses are a nightmare. Sometimes full addresses are written out (i.e. 111 Frederick Douglass Blvd) other times the same address will be…
ynnad
  • 73
  • 1
  • 6
4
votes
1 answer

How to remove error values in large df with 1000 columns

I have a large dataset with more than 1000 columns, the dataset is messy with mixed dtypes. There are 2 int64 columns, 119 float columns and 1266 object columns. I would like to begin data cleaning but realised there are several issues. As there are…
wjie08
  • 433
  • 2
  • 11
4
votes
1 answer

Replace the # values present in a column in pandas dataframe with auto-incremental values by rows

The scenario is: VendorID column contains '#' in all rows of pandas dataframe. I have been trying to substitute the value of '#' in VendorID column to the auto increment row number value. I was trying str.replace() function :…
Sri2110
  • 335
  • 2
  • 19
4
votes
2 answers

UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' when printing in UTF-8 locale

I am cleaning the monolingual corpus of Europarl for French (http://data.statmt.org/wmt19/translation-task/fr-de/monolingual/europarl-v7.fr.gz). The original raw data in .gzfile (I downloaded using wget). I want to extract the text and see how it…
Sophil
  • 223
  • 1
  • 9
4
votes
1 answer

How to use the pandas.melt() while keeping the NaN values?

I`m cleaning a messy data frame where some of the information needed appear in the column names. This information should melt to a single column that would be created. index name animal fruit …
Ahmad Ali
  • 45
  • 9
4
votes
5 answers

Python Summing up Rows in Dataframe with the same Key

I want to sum up rows in a dataframe which have the same row key. The purpose will be to shrink the data set size down. For example if the data frame looks like this. Fruit Count Apple 10 Pear 20 Apple 5 Banana …
mrsquid
  • 605
  • 2
  • 9
  • 24
4
votes
2 answers

Retrieving the row values in pandas

I have a dataframe which have two columns countries data United states of america(USA) 1 india13 2 i want to get the data from row in this format countries data United states…
surya
  • 71
  • 2
4
votes
1 answer

Large data set cleaning: How to fill in missing data based on multiple categories and searching by row order

This is my first StackOverflow post, so I hope that it isn't too difficult to understand. I have a large dataset (~14,000) rows of bird observations. These data were collected by standing in one place (point) and counting birds that you see within 3…
Dylan_Gomes
  • 2,066
  • 14
  • 29
4
votes
0 answers

create sub-index of groups with pandas and groupby

I have a dataframe that has an ID column and I would like to add a column to the dataframe is an index for each unique ID. I was able to do this using 2 for-loops with the example below by making a list from the ID count, converting it to an array,…
zipline86
  • 561
  • 2
  • 7
  • 21
4
votes
1 answer

Messy date formats in data frame

I created a task for myself that I cannot solve - there is a dataframe with start dates and end dates of some projects. Some elements are wrong and show the duration of a project instead of the end date. start_date <- c("2017-05-04",…
LuckyLuck
  • 73
  • 1
  • 7
4
votes
6 answers

How do I only keep observations based on the max values after their decimal point?

I want to make this dataframe: (edited to show that it's an actual data frame with more than 1 column) ID = c(100.00, 100.12, 100.36, 101.00, 102.00, 102.24, 103.00, 103.36, 103.90) blood = c(55, 54, 74, 42, 54, 45, 65, 34, 44) df = data.frame(ID,…
StatsNTats
  • 49
  • 5