Questions tagged [data-extraction]

Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration).

Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration). The import into the intermediate extracting system is thus usually followed by data transformation and possibly the addition of metadata prior to export to another stage in the data workflow.

Typical unstructured data sources include web pages, emails, documents, PDFs, scanned text, mainframe reports, spool files etc. Extracting data from these unstructured sources has grown into a considerable technical challenge where as historically data extraction has had to deal with changes in physical hardware formats, the majority of current data extraction deals with extracting data from these unstructured data sources, and from different software formats. This growing process of data extraction from the web is referred to as Web scraping.

The act of adding structure to unstructured data takes a number of forms:

  • Using text pattern matching such as regular expressions to identify small or large-scale structure e.g. records in a report and their associated data from headers and footers;
  • Using a table-based approach to identify common sections within a limited domain e.g. in emailed resumes, identifying skills, previous work experience, qualifications etc. using a standard set of commonly used headings (these would differ from language to language), e.g. Education might be found under Education/Qualification/Courses;
  • Using text analytics to attempt to understand the text and link it to other information
939 questions
3
votes
1 answer

Extract width and height from image links via ImgApp Google Sheet Library

I want to extract the dimensions of the image from the URL in google Sheet. Found this Library that does exactly what I am after. https://github.com/tanaikech/ImgApp#getsize But I am very new to this scenario and wondering what exactly I should use…
3
votes
0 answers

how can i extract game resources made by Unity3d?

I operated according to the tutorial of the following websitethis is the tutorial, but found that the contents of the file were different from those in the tutorial. As a result, when I extracted the assets folder, there was no content I wanted to…
3
votes
0 answers

SQL Server linked server to a Microsoft Dataverse environment

I would like to connect from an on premise Microsoft SQL Server environment to a Dataverse environment in Azure. I want to be able to download data from Dataverse to SQL Server. I would like to know if i can use linked servers to do this and what…
Nige
  • 31
  • 1
3
votes
1 answer

str_extract_all with decimal numbers

I have this dataframe (DF1) structure(list(ID = 1:3, Temperature = c("temp 37.8 37.6", "37,8 was body temperature", "110 kg and 38 temp")), class = "data.frame", row.names = c(NA, -3L)) ID Temperature 1 "temp 37.8 37.6" 2 "37,8 was body…
onhalu
  • 735
  • 1
  • 5
  • 17
3
votes
1 answer

Export Google Docs comments into Google Sheets, along with highlighted text?

Would there be a way to export comments from Google Docs so that the comments show up in a Google Sheets doc in one column and the highlighted text from the Google Doc shows up in the column next to it? I understand that file comments are accessible…
3
votes
1 answer

Extract data from pdf invoices of varying formats

The objective is to extract data out of invoices in pdf format. Pdf data format: selectable text (not scanned images) consists of lines of text, name-value pairs, tables (of varying lengths) Invoices data includes: invoice_no, invoice_date,…
3
votes
2 answers

Printing top few lines of a large JSON file in Python

I have a JSON file whose size is about 5GB. I neither know how the JSON file is structured nor the name of roots in the file. I'm not able to load the file in the local machine because of its size So, I'll be working on high computational servers.…
Deni Avinash
  • 31
  • 1
  • 2
3
votes
4 answers

Python: how to assign multiple values to one key

I extract data using API and retrieve a list of servers and backups. Some servers have more than one backup. This is how I get list of all servers with backaup IDs. bkplist = requests.get('https://heee.com/1.2/storage/backup') bkplist_json =…
Brzozova
  • 382
  • 2
  • 15
3
votes
1 answer

Python extract multiple lat/lon from NETCDF files using xarray

I have a NC file (time, lat, lon) Download from here and I am trying to extracting time series of multiple stations (lat/lon points Download from here). So I tried it this way to read the coordinates and extract the nearest values from the NC file…
Seji
  • 371
  • 1
  • 10
3
votes
1 answer

How to extract tabular data from a website using R

I am trying to extract the data from the webpage https://www.geojit.com/other-market/world-indices and many others similar to this. I need to get the tabular data of the website (INDEX,NAME,COUNTRY,CLOSE,PREV.CLOSE,NET CHANGE,CHANGE (%),LAST…
Evan Strom
  • 65
  • 1
  • 7
3
votes
2 answers

Retrieve last n rows based on one numeric column in google sheet

My data looks like this: +---------------+-----+-----+------+-----+-----+ | Serial Number | LSL | LCL | DATA | UCL | USL | +---------------+-----+-----+------+-----+-----+ | 1 | 1 | 3 | 2.3 | 7 | 9 | | 2 | 1 | 3…
3
votes
4 answers

Read .txt data using python

I have a .txt file like this: # 经纬度 x1 = 11.21 x2 = 11.51 y1 = 27.84 y2 = 10.08 time: 201510010000 变量名: val1 [1.1,1.2,1.3] 变量名: va2 [1.0,1.01,1.02] time: 201510010100 变量名: val1 [2.1,2.2,2.3] 变量名: va2 …
user10025959
  • 55
  • 1
  • 7
3
votes
1 answer

Extracting key-value pairs from OCR text

Im supposed to use OCR to identify text in legal documents, extract relevant keys and their values (around 40 attributes), and then store them in an excel sheet. I've already implemented the OCR part, and have my dictionary defined something like…
A Jain
  • 31
  • 1
  • 5
3
votes
0 answers

Pdf Text wrong character extraction

I have a pdf page with a formula as: When text is extracted, few characters are wrong. Text looks like this: /ToUnicode Object 33 0 R unfiltered stream looks like this: Encoding looks like this: Rendering instructions are below: Unicode Vulgar…
Mack
  • 149
  • 1
  • 8
3
votes
1 answer

need to all extract the content inside the brackets in pandas dataframe

I need to extract only the content inside the brackets in pandas dataframe. I tried using str.exratct() but its not working . I need help with the extraction DATA: ( IS IN DATA FRAME, This is a sample data from one row ) By:Chen TX (Chen Tianxu)[ 1…
1 2
3
62 63