Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions
0
votes
1 answer

Consolidate multiple columns in CSV into one

I have a CSV that contains 1000 versions of the same df/table(each from different sources) with columns Name and Age. Here is a sample df to illustrate what this csv looks like data = [['11', 'Nick', '10', 'Dave', '4', 'Greg'], ['7', 'Nick', '10',…
sdpro19
  • 11
  • 2
0
votes
1 answer

Clean a String to convert to int in Pandas

I have a dataframe in pandas, with a column named "Score". I have to do a rank,nevertheless in the column I have numbers like: 100 200 300 but also numbers like: 500,00 800,00 I need to clean this column to rank. when I try to convert the string to…
0
votes
1 answer

Complex survey data matching and cleaning

I'm trying to match some super complex personal identifiers across ~6000 survey responses, currently stored in a big Excel doc. In essence, respondents participated at time 1 and 2. They were each required to input their government identifier at…
0
votes
1 answer

How to merge rows that have multiple levels for specific columns in pandas

My data I am working with the following data from the National Centers for Environmental Information (NCEI) - obtained simply by using pandas' read_html(). df =…
bismo
  • 1,257
  • 1
  • 16
  • 36
0
votes
0 answers

Cleaning up text files for spark dataframe

I want to create a spark dataframe by reading some text files. However, the text files have some weird formatting. This is one example of the text file: These are the problems I am facing: In the first few lines, there are some headers which…
Mash
  • 5
  • 2
0
votes
1 answer

Filter the dataframe preserving the first month/year from top to bottom by variable in R

I have the following dataframe in R: library(dplyr) library(tsibble) library(fpp3) usda <- read.csv("https://raw.githubusercontent.com/rhozon/datasets/master/usda_data_stovwflw.csv", head = TRUE, sep = ";") |> mutate( Dates = case_when( …
0
votes
1 answer

Creating muliple dataframes using a loop in R

I am trying to subset my data but I was attempting to do it in a loop and then store those subsets into different dataframes. I have a dataframe named data with 20000 variables and I wanted to get sub-sets of that data. One way of doing it would be…
thole
  • 117
  • 6
0
votes
1 answer

How to transpose/pivot one column into row flipping all the row information after it?

I have weather station data that I have imported with read_csv in RStudio. It's messy, my main concern is the column called "item" which contains variables for weather readings. This needs to be a row/header, with all the time series data also…
0
votes
1 answer

Collapsing multiple observations based on specific parameters in R

I am quite new to R. I have a dataset with 8081 observations for 113 variables. The data was collected in 4 waves (panels), with some individuals being interviewed multiple times. They were sometimes asked the same questions, but some questions were…
ouroboro
  • 45
  • 4
0
votes
3 answers

Remove words from list but keep the ones only made up from the list

I have one dataframe containing strings and one list of words that I want to remove from the dataframe. However, I would like to also keep the strings from the df which are entirely made up of words from the list. Here is an…
0
votes
1 answer

F#: CSV cleaning

I'm trying using F# to clean a CSV dataset. For example I want to change string values in one column to lower case. I don't know if it's better to work with loaded data as CsvProvider Rows or I should create some struct and convert these Rows to…
grep
  • 3
  • 1
0
votes
1 answer

how to add the two split columns on a position where you want in dataset

import pandas as pd df[["First Name", "Last Name"]] = df["Full Name"].str.split(' ', 1, expand=True) enter image description here as you see in this image that it add the split columns at the end of the dataset and i want them to be at same…
Biny Kan
  • 1
  • 5
0
votes
2 answers

R - data cleaning and mutate errors

I have reviewed Error: Problem with mutate() column (...) must be size 15 or 1, not 17192, How to drop columns with column names that contain specific string?, Remove columns that contain a specific word, and associated error troubleshooting. I have…
0
votes
1 answer

Python script to transpose certain column's values into separate columns

I have this data frame customer_id customer_location customer_contact_id customer_contact_location 1 ES 10 DE 1 ES 11 DE 1 ES 12 FR 2 FR 20 GB 3 ES 87 ES 3 ES 88 ES I need to transpose it in a way so there is one row per…
0
votes
0 answers

why my nested loop in R only run the child loop once?

I expected to get an output of cleaned data in 4 sheets and each with 9 columns. Original data looks like this: but only output the first column in the first sheet: my codes are as…
Kira
  • 1
  • 1
1 2 3
99
100