Questions tagged [data-cleaning]

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged. One approach to data cleaning is the "tidy data" framework from Wickham, which means each row is an observation and each column is a variable.

Data cleaning is the process of removing or repairing errors, and normalizing data used in computer programs. For example, outliers may be removed, missing samples may be interpolated, invalid values may be marked as unavailable, and synonymous values may be merged.

One approach to data cleaning is the "tidy data" framework from Wickham, http://vita.had.co.nz/papers/tidy-data.pdf, which means each row is an observation and each column is a variable.

3430 questions
0
votes
1 answer

How can I remove special characters in python like ('$9.99', '@10.99', '#13.99') from a string column, without moving the decimal point?

I am working on a data cleaning exercise where I need to remove special characters like '$#@' from the 'price' column, which is of object type (string). After that, I need to convert it to float type. However, the decimal point position changes when…
0
votes
1 answer

Recoding same variable across multiple data frames

I want to create a simplified way of recoding the same variable (the same way) across multiple data frames. For example, right now I'm re-coding an age variable from state datasets FL and GA. I'm currently coding them separately. How can I condense…
0
votes
0 answers

Pandas Dataframe Row have unusual pattren of alphabet in rows and between the string

I am Talha and new to the Data science community. Today I worked on a data set and found something challenging to resolve this my data set there is a pattern of string that contains the "????". I am trying to remove them by comparison and isnull…
Talha Iqbal
  • 99
  • 1
  • 8
0
votes
1 answer

Updating an old dataframe with conditional matching of different columns and adding new rows in pandas

I have an old dataframe with following columns and lot of rows and look like this >old_df date/time Name detect_ID category ID 12/1/2023 XXX 1 B 1400 12/1/2023 XXY 1,3,7 B 1402 12/1/2023 XXY 4 …
Jewel_R
  • 126
  • 2
  • 17
0
votes
0 answers

Bar Chart in Python Jupyter Lab Not Plotting Instead Throwing Errors

So, I am working on a data analytic project and I want to create a bar graph so on the dataset I have already used the groupby function to group the Month column in other to find the highest sales in a particular month using the following function…
0
votes
0 answers

Create ID variable for groups identified with any crossover between two variables

I have scraped Google Maps data of businesses with many, many duplicates of both phone number and URL's. I need to create a variable that ID's groups where there is any overlap in phone number or URL, going both ways. Any URLs that share a phone…
SDYockey
  • 23
  • 3
0
votes
2 answers

Python drop duplicated pairs only

If I have a dataframe like this: Time X Y 2023-02-01T15:03:02.565333 200 10.1 2023-02-01T15:03:02.565333 200 10.1 2023-02-01T15:03:02.565333 200 10.1 2023-02-01T15:03:02.565333 200 …
des224
  • 119
  • 7
0
votes
1 answer

R dataframe/ lapply(): get rid of rows with particular values in columns containing particular strings, while keeping everything else?

I have 16 dataframes I am trying to quality check and delete poor quality rows in R. I already know of lapply() and have used it for simpler wrangling problems to apply the same thing to all my dataframes at once, but for whatever reason I'm having…
abby23
  • 3
  • 1
0
votes
3 answers

How can I get this ASCII text file into a usable data format?

I want to use NIBRS' "master file download" for arrests in 2021. However, this data comes in an ASCII text file that I do not know how to convert into a usable dataset. It seems like, from the help file, certain positions of the long number string…
leecarvallo
  • 171
  • 4
0
votes
1 answer

How to append the results found in a for loop and if statement

I am practicing looping over the values that are found in a particular column (Z_SCORE). however I want to now append the result of each iteration into a data frame. Can I get some assistance on how I could go about to do that. The new data frame…
thole
  • 117
  • 6
0
votes
1 answer

Replace missing rows of csv data

I have an 80,000 rows csv file made up of four columns ID, Date, Time and Flow. If flow data is ever missing the missing data is skipped over until a new flow data is record and then the data continues to record. Flow measurements are taken every 15…
0
votes
1 answer

How to keep blank value when appending text file line-by-line into array with line.split()?

Working with a text file that looks like this. I am trying to append each line into an array and turning it to a clean dataframe. I used line.split() for lines being appended into the array but values in COL J would disappear for some rows when…
lenoob
  • 1
0
votes
1 answer

SQL: exclude out certain rows in a messy dataset

I am cleaning a dataset using duckdb package in Python. The code is as follows: dt = db.query( """ select * from dt_orig where (A != '') and (B != '' or B != 'Unknown' or B != 'Undefined') and (C != '') and …
Sophia
  • 377
  • 1
  • 12
0
votes
0 answers

How to check if a value is on a list, and if it is change another value on the same row

Basically, I am trying to do some data cleaning. I am working in a data set of a bike sharing company that list every trip made on its system. There are columns for station name and station id for every ride, but in many rows the station_name is…
0
votes
2 answers

Fixing IndexingError to clean the data

I'm trying to identify outliers in each housing type category, but encountering an issue. Whenever I run the code, I receive the following error: "IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the…
1 2 3
99
100