Questions tagged [data-scrubbing]

The process of detecting and correcting (or removing) corrupt or inaccurate records from a data set

Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Used mainly in databases, the term refers to identifying incomplete, incorrect, inaccurate, irrelevant parts of the data and then resolving the issue by either replacing, modifying or deleting the errant data

http://en.wikipedia.org/wiki/Data_cleansing

65 questions
0
votes
1 answer

SQL Server query to return percentage of null content in a table's fields

I'm looking to scrub data in migration project from a legacy system developed on SQL Server 2005, but the first order of business is to figure out which columns aren't really in use. The general logic behind my approach is to identify columns that…
0
votes
2 answers

Open source projects for email scrubbing generating structured data from unstructured source?

Don't know where to start on this one so hopefully you guys can clear up my question. I have project where email will be searched for specific words/patterns and stored in a structured manner. Something that is done with Trip it. The article…
dev.e.loper
  • 35,446
  • 76
  • 161
  • 247
0
votes
0 answers

How can you use R to clean an excel(Xls) data source?

I have an excel file that has this shape I need to format this data so that it looks like this In the second image the second column is just the month written as a number. and the concatenation of month and year is the third. The last two are…
0
votes
0 answers

Data scrubbing and filtering a Dataframe down by conditions and then replacing columns with a string

I'm trying to filter down a dataframe in a large dataset and trying to do a replace function to fill in NaN column values. I've write the below code but don't believe I'm structuring it correctly. Any help would be greatly appreciated. Thanks in…
BPD
  • 3
  • 1
0
votes
1 answer

Fill missing values in dataframe in R

I have the following problem: I have a dataframe with several columns. (See below) I am trying to fill in missing values. Concretely, I only want to fill in values when I have a datapoint before and one after the missing value and when they are…
Max
  • 33
  • 1
  • 3
0
votes
2 answers

Python/ Beautiful Soup Data Displaying Issue

I am trying to pull some data from a website. Once I checked the data that I pulled with beuatifulsoup (using print(soup) in the code below) does not seem very well. It is different than once I check with view-source:URL. I am unable to find the…
marista
  • 25
  • 5
0
votes
1 answer

Find if integer exists within a list of ranges

Given an array N of 1,000,000 unique integers ranging from 0 to 1,999,999. What is the fastest way to filter out integers that do not exist within any range inside of M - where M is a fixed group of 10 random ranges each with integers ranging from 0…
0
votes
3 answers

Fastest String Filtering Algorithm

I have 5,000,000 unordered strings formatted this way (Name.Name.Day-Month-Year 24hrTime): "John.Howard.12-11-2020 13:14" "Diane.Barry.29-07-2020 20:50" "Joseph.Ferns.08-05-2020 08:02" "Joseph.Ferns.02-03-2020 05:09" "Josephine.Fernie.01-01-2020…
0
votes
1 answer

ceph pg repair doesnt start right away

Every now and then i get a single pg inconsistency error on my cluster. As suggested by the docs I run ceph pg repair pg.id and the command gives "instructing pg x on osd y to repair" seems to be working as intended. However it doesn't start right…
Nyquillus
  • 179
  • 1
  • 5
  • 23
0
votes
1 answer

Python pandas if column value is list then create new column(s) with individual list value

I'm using pandas to create a dataframe from a SaaS REST API json response and hitting a minor blocker to cleanse the data for visualization and analysis. I need to tweak the python script by adding a conditional function to say if the value is in a…
0
votes
0 answers

How to get scrape specific URL from multiple URL in Webpage Java

I am doing data scraping for the first time. My assignment is to get specific URL from webpage where there are multiple links (help, click here etc). How can I get specific url and ignore random links? In this link I only want to get The SEC adopted…
0
votes
2 answers

Clean data to be imported into Neo4J database

I am a Neo4j and data analytics noob here. I am looking for programmatic way to format data that I collect from Active Directory to have it prepared to be imported into Neo4j. Right now, I am using PowerBI and DAX Studios to clean the data the…
POSH Geek
  • 174
  • 1
  • 11
0
votes
1 answer

Importing data from website with subscription

The website I import data from is now subscription-based, I have a subscription but HTML import function doesn't pull data. =IMPORTHTML("https://www.footballoutsiders.com/premium/defense-vs-receivers?year=2018&offense_defense=offense","table",1) I…
Ricky
  • 11
  • 5
0
votes
1 answer

Data Scrub off Interactive Map - Cal Fire related

http://calfire-forestry.maps.arcgis.com/apps/webappviewer/index.html?id=5306cc8cf38c4252830a38d467d33728&extent=-13547810.5486%2C4824920.1673%2C-13518764.4778%2C4841526.1117%2C102100 how can i scrub the locations off this? Don't need addresses, just…
James
  • 69
  • 5
0
votes
0 answers

ZFS without ECC, how checksum would work?

assuming I have read a lot about ZFS with/without ECC, there are quite a few opinions online.. I have still doubts that I could not clarify myself reading the available documents. Suppose I have two disks mirrored and ZFS (no ECC in my system) let's…
Tiutto
  • 179
  • 1
  • 12