Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions
0
votes
1 answer

Loading and cleaning a very large JSON file

I'm working on an image classification project using the Snapshot Serengeti dataset. The dataset comes with a single very large JSON file (5GB+) that contains for top level keys. I specifically need the values contained in the "images": [{...},…
0
votes
1 answer

How do you apply a script to all files in a folder?

I have several txt files that I have successfully converted into csv files and I now want to clean them all in the same manner, but my script is having issues reading the file names. First I converted all txt files in my folder of interest into csv…
Adriana
  • 91
  • 8
0
votes
2 answers

How to rename the columns inside nested column in pyspark

I have a column product inside which there is a nested column called Color. I want to remove the {} from the column color. I don't want to flatten the column and rename it. I directly want to rename the column or drop the column. |-- product:…
0
votes
1 answer

I have to replace repeated values based on a specific condition, the replacement value depends on the repeated value and changes between rows

Hope you are doing good! I am having some trouble figuring this one out and couldn´t find any question that helps me with it. The DB has multiple rows with repeated values,the sale price of real state propertys. Those rows have the total price of…
0
votes
0 answers

How can I change the type of a csv file in DataCleaner?

I'm using DataCleaner from https://datacleaner.github.io/ I have a csv file with the data: firstname,lastname,phone,email,attendence,date all,bundy,123,all@bundy.com,false,1-1-2000 After I load the data into DataCleaner and I want to do a boolean…
Edwin
  • 294
  • 2
  • 13
0
votes
2 answers

Python code to remove line breaks in documents is not working

I have multiple Word documents in a directory. I am using python-docx to clean them up. It's a long code, but one small part of it that you'd think would be the easiest is not working. After making some edits, I need to remove all line breaks and…
Leila
  • 182
  • 1
  • 1
  • 8
0
votes
0 answers

How to Remove Inverted Commas from a Power BI Power Query Column?

I am trying to remove inverted commas from a column in Power BI Power Query using the code dataset['Column1'] = dataset['Custom'].apply(remove_quotes), where remove_quotes is defined as def remove_quotes(s): return s.replace('"', '').replace("'",…
0
votes
0 answers

How can I compare two datasets, one before cleaning and the other after cleaning?

The code: def clean(df): df = df.rename(columns={'Pigm. 2 Name': 'pigm_2_name', 'Pigm. 2 [%]': 'pigm_2_g',}) binder_cols = ['bm_1_name', 'bm_2_name', 'bm_2_name'] binder_map = {'At': 'AT', 'EP ': 'EP'} df =…
0
votes
1 answer

How to I replace 0 values of features in a dataset, with its median value corresponding to the label?

For my Exploratory Data Analysis Project the dataset looks as follows : An Image of Dataset for Reference Link to GitHub Repository for Dataset The features of my dataset…
0
votes
0 answers

Disabled Jenkins job date

I want to know the API to get a particular job disable date in Jenkins. I have a list of disable jobs, on the basis of when it is disable I will delete that jobs.
0
votes
1 answer

When do data cleaning, how to uniform the different year formats?

I'm doing data cleaning and found there are different formats in the year column: e.g. 2011, 2012-2013, 2010-14. How to correct these errors and show only the latest year in cell, i.e. 2011, 2013, 2014. I tried the below codes. It works for…
libraG
  • 1
  • 2
0
votes
1 answer

How to remove HTML line breaks
?

I have a dataset of web scraped reviews and unfortunately they contain a lot of the
tags, so after I clean the data (remove stopwords etc.), a lot of single "br" remain in the dataset. I would like to remove these line breaks as well as some…
TobiP
  • 1
  • 1
0
votes
3 answers

How do I remove special characters found on column names in R

I have a dataframe that has a mix of numeric and character variables. I have scaled the numeric columns in my dataframe. This has now resulted in special characters being added to my numeric column names. I want to remove those special characters…
thole
  • 117
  • 6
0
votes
1 answer

Removing special characters from column names in R

I have a dataset where some of the columns have special characters. I want to clean this dataset and remove these special characters from all the columns that have them. A subset of the column names is…
thole
  • 117
  • 6
0
votes
0 answers

How do I remove the

tag while cleaning rss xml data?

I am cleaning rss feed data that I pulled using feedparser. I managed to remove all special characters but I am unable to remove the "p" from the tag

. How can I remove this? I tried this code: def clean_text(text): return…

Libra
  • 1
  • 1