Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions

votes

1 answer

How can I remove special characters in python like ('$9.99', '@10.99', '#13.99') from a string column, without moving the decimal point?

I am working on a data cleaning exercise where I need to remove special characters like '$#@' from the 'price' column, which is of object type (string). After that, I need to convert it to float type. However, the decimal point position changes when…

asked Feb 28 '23 at 02:04

Simon Levi

votes

1 answer

Recoding same variable across multiple data frames

I want to create a simplified way of recoding the same variable (the same way) across multiple data frames. For example, right now I'm re-coding an age variable from state datasets FL and GA. I'm currently coding them separately. How can I condense…

r dplyr data-cleaning recode

asked Feb 27 '23 at 17:03

Still learning

votes

0 answers

Pandas Dataframe Row have unusual pattren of alphabet in rows and between the string

I am Talha and new to the Data science community. Today I worked on a data set and found something challenging to resolve this my data set there is a pattern of string that contains the "????". I am trying to remove them by comparison and isnull…

python pandas dataframe data-cleaning

asked Feb 27 '23 at 12:13

Talha Iqbal

votes

1 answer

Updating an old dataframe with conditional matching of different columns and adding new rows in pandas

I have an old dataframe with following columns and lot of rows and look like this >old_df date/time Name detect_ID category ID 12/1/2023 XXX 1 B 1400 12/1/2023 XXY 1,3,7 B 1402 12/1/2023 XXY 4 …

python pandas dataframe data-analysis data-cleaning

asked Feb 27 '23 at 04:14

Jewel_R

votes

0 answers

Bar Chart in Python Jupyter Lab Not Plotting Instead Throwing Errors

So, I am working on a data analytic project and I want to create a bar graph so on the dataset I have already used the groupby function to group the Month column in other to find the highest sales in a particular month using the following function…

python-3.x pandas visualization data-cleaning

asked Feb 26 '23 at 07:41

Quadrry

votes

0 answers

Create ID variable for groups identified with any crossover between two variables

I have scraped Google Maps data of businesses with many, many duplicates of both phone number and URL's. I need to create a variable that ID's groups where there is any overlap in phone number or URL, going both ways. Any URLs that share a phone…

r data-cleaning

asked Feb 25 '23 at 04:33

SDYockey

votes

2 answers

Python drop duplicated pairs only

If I have a dataframe like this: Time X Y 2023-02-01T15:03:02.565333 200 10.1 2023-02-01T15:03:02.565333 200 10.1 2023-02-01T15:03:02.565333 200 10.1 2023-02-01T15:03:02.565333 200 …

python pandas dataframe duplicates data-cleaning

asked Feb 24 '23 at 06:36

des224

votes

1 answer

R dataframe/ lapply(): get rid of rows with particular values in columns containing particular strings, while keeping everything else?

I have 16 dataframes I am trying to quality check and delete poor quality rows in R. I already know of lapply() and have used it for simpler wrangling problems to apply the same thing to all my dataframes at once, but for whatever reason I'm having…

r lapply data-cleaning data-wrangling grepl

asked Feb 23 '23 at 18:35

abby23

votes

3 answers

How can I get this ASCII text file into a usable data format?

I want to use NIBRS' "master file download" for arrests in 2021. However, this data comes in an ASCII text file that I do not know how to convert into a usable dataset. It seems like, from the help file, certain positions of the long number string…

r ascii stata data-cleaning

asked Feb 23 '23 at 14:53

leecarvallo

votes

1 answer

How to append the results found in a for loop and if statement

I am practicing looping over the values that are found in a particular column (Z_SCORE). however I want to now append the result of each iteration into a data frame. Can I get some assistance on how I could go about to do that. The new data frame…

r dataframe for-loop if-statement data-cleaning

asked Feb 23 '23 at 09:57

thole

votes

1 answer

Replace missing rows of csv data

I have an 80,000 rows csv file made up of four columns ID, Date, Time and Flow. If flow data is ever missing the missing data is skipped over until a new flow data is record and then the data continues to record. Flow measurements are taken every 15…

python csv data-cleaning

asked Feb 21 '23 at 21:23

Andrew_Weather

votes

1 answer

How to keep blank value when appending text file line-by-line into array with line.split()?

Working with a text file that looks like this. I am trying to append each line into an array and turning it to a clean dataframe. I used line.split() for lines being appended into the array but values in COL J would disappear for some rows when…

python arrays data-cleaning txt

asked Feb 21 '23 at 16:41

lenoob

votes

1 answer

SQL: exclude out certain rows in a messy dataset

I am cleaning a dataset using duckdb package in Python. The code is as follows: dt = db.query( """ select * from dt_orig where (A != '') and (B != '' or B != 'Unknown' or B != 'Undefined') and (C != '') and …

sql data-cleaning

asked Feb 20 '23 at 04:41

Sophia

votes

0 answers

How to check if a value is on a list, and if it is change another value on the same row

Basically, I am trying to do some data cleaning. I am working in a data set of a bike sharing company that list every trip made on its system. There are columns for station name and station id for every ride, but in many rows the station_name is…

r dataframe data-cleaning

asked Feb 18 '23 at 01:10

Francisco Comparatore

votes

2 answers

Fixing IndexingError to clean the data

I'm trying to identify outliers in each housing type category, but encountering an issue. Whenever I run the code, I receive the following error: "IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the…

python pandas nan data-cleaning

asked Feb 17 '23 at 06:31

Omarov Alen

Prev 1 2 3

…

100