Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions

votes

1 answer

Consolidate multiple columns in CSV into one

I have a CSV that contains 1000 versions of the same df/table(each from different sources) with columns Name and Age. Here is a sample df to illustrate what this csv looks like data = [['11', 'Nick', '10', 'Dave', '4', 'Greg'], ['7', 'Nick', '10',…

asked Mar 16 '23 at 00:44

sdpro19

votes

1 answer

Clean a String to convert to int in Pandas

I have a dataframe in pandas, with a column named "Score". I have to do a rank,nevertheless in the column I have numbers like: 100 200 300 but also numbers like: 500,00 800,00 I need to clean this column to rank. when I try to convert the string to…

python integer data-cleaning

asked Mar 15 '23 at 17:01

Jheison G Salazar Munoz

votes

1 answer

Complex survey data matching and cleaning

I'm trying to match some super complex personal identifiers across ~6000 survey responses, currently stored in a big Excel doc. In essence, respondents participated at time 1 and 2. They were each required to input their government identifier at…

r excel stata data-cleaning survey

asked Mar 13 '23 at 20:53

user18103480

votes

1 answer

How to merge rows that have multiple levels for specific columns in pandas

My data I am working with the following data from the National Centers for Environmental Information (NCEI) - obtained simply by using pandas' read_html(). df =…

python pandas data-cleaning data-wrangling

asked Mar 12 '23 at 13:26

bismo

1,257
1
16
36

votes

0 answers

Cleaning up text files for spark dataframe

I want to create a spark dataframe by reading some text files. However, the text files have some weird formatting. This is one example of the text file: These are the problems I am facing: In the first few lines, there are some headers which…

apache-spark formatting text-files data-cleaning

asked Mar 09 '23 at 16:27

Mash

votes

1 answer

Filter the dataframe preserving the first month/year from top to bottom by variable in R

I have the following dataframe in R: library(dplyr) library(tsibble) library(fpp3) usda <- read.csv("https://raw.githubusercontent.com/rhozon/datasets/master/usda_data_stovwflw.csv", head = TRUE, sep = ";") |> mutate( Dates = case_when( …

r dataframe dplyr tidyverse data-cleaning

asked Mar 08 '23 at 22:57

Rodrigo H. Ozon

votes

1 answer

Creating muliple dataframes using a loop in R

I am trying to subset my data but I was attempting to do it in a loop and then store those subsets into different dataframes. I have a dataframe named data with 20000 variables and I wanted to get sub-sets of that data. One way of doing it would be…

r dataframe for-loop subset data-cleaning

asked Mar 08 '23 at 13:39

thole

votes

1 answer

How to transpose/pivot one column into row flipping all the row information after it?

I have weather station data that I have imported with read_csv in RStudio. It's messy, my main concern is the column called "item" which contains variables for weather readings. This needs to be a row/header, with all the time series data also…

r dplyr tidyverse tidyr data-cleaning

asked Mar 08 '23 at 03:47

Sarah Lawhun

votes

1 answer

Collapsing multiple observations based on specific parameters in R

I am quite new to R. I have a dataset with 8081 observations for 113 variables. The data was collected in 4 waves (panels), with some individuals being interviewed multiple times. They were sometimes asked the same questions, but some questions were…

r duplicates data-cleaning summarize

asked Mar 07 '23 at 14:15

ouroboro

votes

3 answers

Remove words from list but keep the ones only made up from the list

I have one dataframe containing strings and one list of words that I want to remove from the dataframe. However, I would like to also keep the strings from the df which are entirely made up of words from the list. Here is an…

python string replace substring data-cleaning

asked Mar 06 '23 at 14:20

RoyalPotatoe

votes

1 answer

F#: CSV cleaning

I'm trying using F# to clean a CSV dataset. For example I want to change string values in one column to lower case. I don't know if it's better to work with loaded data as CsvProvider Rows or I should create some struct and convert these Rows to…

csv f# data-cleaning f#-data

asked Mar 05 '23 at 16:34

grep

votes

1 answer

how to add the two split columns on a position where you want in dataset

import pandas as pd df[["First Name", "Last Name"]] = df["Full Name"].str.split(' ', 1, expand=True) enter image description here as you see in this image that it add the split columns at the end of the dataset and i want them to be at same…

pandas dataset data-analysis data-cleaning

asked Mar 03 '23 at 15:33

Biny Kan

votes

2 answers

R - data cleaning and mutate errors

I have reviewed Error: Problem with mutate() column (...) must be size 15 or 1, not 17192, How to drop columns with column names that contain specific string?, Remove columns that contain a specific word, and associated error troubleshooting. I have…

r dplyr tidyverse data-cleaning mutate

asked Mar 02 '23 at 16:05

Marnee Roundtree

votes

1 answer

Python script to transpose certain column's values into separate columns

I have this data frame customer_id customer_location customer_contact_id customer_contact_location 1 ES 10 DE 1 ES 11 DE 1 ES 12 FR 2 FR 20 GB 3 ES 87 ES 3 ES 88 ES I need to transpose it in a way so there is one row per…

python pandas function data-cleaning

asked Mar 01 '23 at 17:17

Kristina

votes

0 answers

why my nested loop in R only run the child loop once?

I expected to get an output of cleaned data in 4 sheets and each with 9 columns. Original data looks like this: but only output the first column in the first sheet: my codes are as…

r for-loop nested-loops data-cleaning

asked Mar 01 '23 at 03:22

Kira

Prev 1 2 3

…

100 Next