Questions tagged [data-extraction]

Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration).

Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration). The import into the intermediate extracting system is thus usually followed by data transformation and possibly the addition of metadata prior to export to another stage in the data workflow.

Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a considerable technical challenge where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as Web scraping.

The act of adding structure to unstructured data takes a number of forms:

  • Using text pattern matching such as regular expressions to identify small or large-scale structure e.g. records in a report and their associated data from headers and footers;
  • Using a table-based approach to identify common sections within a limited domain e.g. in emailed resumes, identifying skills, previous work experience, qualifications etc. using a standard set of commonly used headings (these would differ from language to language), e.g. Education might be found under Education/Qualification/Courses;
  • Using text analytics to attempt to understand the text and link it to other information
939 questions
5
votes
1 answer

Opening .gdb database files

I'm trying to open an old interbase .gdb file. This is a new step for me and i don't know where to start any advice would be a great help, I've been searching the internet for the past few days now and i still have to idea how to do so.
Leo Elvin Lee
  • 189
  • 1
  • 6
  • 16
5
votes
1 answer

imacros extraction from a range of data

Hi here is how my page looks like
Beamer
Michal K
  • 245
  • 2
  • 9
  • 17
5
votes
1 answer

Extract p-value from Kruskal-Wallis output

Let's say I have a dataframe > col1<-c(1,5,2,6,8,1,3,8,9,1,8) > col2<-c(1,2,1,1,2,2,1,2,2,1,1) > df<-data.frame(col1,col2) > df col1 col2 1 1 1 2 5 2 3 2 1 4 6 1 5 8 2 6 1 2 7 3 1 8 8 2 9 …
Olli J
  • 649
  • 2
  • 7
  • 23
5
votes
0 answers

Tika 1.1 Performance Improvement

I am using tika 1.1, I am facing issue that tika is taking long time for extracting the content from file. For extracting 1MB of pdf/doc file it taking time around ~3Second. Is there any way to improve performance ? Any tuning ,configuration which…
Chetan Laddha
  • 993
  • 8
  • 22
5
votes
1 answer

DOMXPath var_dump: "(object value omitted)"

$store = curl_exec($ch); // Returns a page of HTML $doc = new DOMDocument(); $doc->loadHTML($store); $xpath = new DOMXpath($doc); Vardump $xpath: object(DOMXPath)#2 (1) { ["document"] => string(22) "(object value omitted)" } What is wrong…
CodeGuru
  • 3,645
  • 14
  • 55
  • 99
4
votes
1 answer

Extract text from borderless table from an image in Python

I am new to opencv and need help in extracting text from a borderless table present in an image. Need to extract text from the below image. I want to extract text and put the information in a data frame. Expected output
4
votes
1 answer

Extract value from output and send to next task

I am trying to define a template in Ansible Tower, where I want to extract the id for the Active Controller in Kafka Broker and then use this value in another template / task that will perform the rolling restart but will make sure the active…
adbdkb
  • 1,897
  • 6
  • 37
  • 66
4
votes
2 answers

How to extract files from saz file?

I exported a session from Fiddler to saz files. This session includes only jpg files and I'm wondering - how can I extract the jpg files from saz quickly and easily? Thanks!
Yanirmr
  • 923
  • 8
  • 25
4
votes
1 answer

How can I extract multiple .zip files?

I'm trying to extract multiple files from some .zip archives. My code is: import os import zipfile os.chdir('/home/marlon/Shift One/Projeto Philips/Consolidação de Arquivos') for f in os.listdir("/home/marlon/Shift One/Projeto Philips/Consolidação…
4
votes
2 answers

corpus extraction with changing data type R

i have a corpus of text files, contains just text, I want to extract the ngrams from the texts and save each one with his original file name in matrixes of 3 columns.. library(tokenizer) myTokenizer <- function(x, n, n_min)…
user10107509
4
votes
1 answer

How to digitize (extract data from) a heat map image using Python?

There are several packages available to digitize the line graphs e.g. GetData Graph Digitizer. However, for digitzation of heat maps I could not find any packages or programs. I want to digitize the heat map (images from png or jpg format) using…
Neeraj Hanumante
  • 1,575
  • 2
  • 18
  • 37
4
votes
0 answers

Reading DWG file in Python and extracting edge points

I have a DWG file in which I have a rectangle with several lines in it (e.g. floor plan with interior walls). How can I use Python to extract the edges (X,Y Coordinates)? I need to extract the floor plan as a graph with nodes and edges defined. So…
USC.Trojan
  • 391
  • 5
  • 14
4
votes
2 answers

How to extract a list of items using scrapely?

I'm using scrapely to extract data from some HTML, but I'm having difficulties extracting a list of items. The scrapely github project describes only a simple example: from scrapely import Scraper s = Scraper() s.train(url,…
rkmax
  • 17,633
  • 23
  • 91
  • 176
4
votes
1 answer

How to extract data from website using AngleSharp & LINQ?

I'm trying to extract the prices from the below mentioned website. I'm using AngleSharp for the extraction. In the website, the prices are listed below (as an example): 650.00 I'm using the…
inquisitive_one
  • 1,465
  • 7
  • 32
  • 56
3
votes
1 answer

Extract a time and space variable from a moving ship from the ERA5 reanalysis

I want to extract the measured wind from a station inside a moving ship, which I have the latitude, longitude and time values and the wind value for each time step in space. I can extract a fixed point in space for all time steps but I would like to…
1
2
3
62 63