Questions tagged [extract]

Questions related to retrieving specific information from a (typically minimally structured) data source, such as a web site, media file, source code collection or compressed archive (in which case the desired information is one or more original, uncompressed files). When using this tag, please include additional tags to clarify which specific environment/language/scenario your question refers to.

Data extraction is a term with many different but related meanings, including:

  • Parsing files (such as HTML pages) or file metadata in order to obtain certain information. This often involves

  • Retrieving single frames from audio, video or image files

  • Breaking up functionality in a single source code unit (e.g. a function) into multiple units:

  • Retrieving the original files from a (optionally compressed) archive file, such as a .zip or .tar file.

and should be added as a synonym for this tag.

6876 questions
1
vote
2 answers

Is the code I'm using to make .zip files correct?

I'm using this code in C# to zip files.. I need to open these files in an Android app (java): String mp3Files = "E:\\"; int TrimLength = mp3Files.ToString().Length; byte[] obuffer; string outPath = mp3Files + "\\" + i + ".zip"; ZipOutputStream…
Omar
  • 7,835
  • 14
  • 62
  • 108
1
vote
2 answers

Extract text between matching parentheses

Consider the following example: x <- "something('pineapple', 'orange', y = c('peach', 'banana'), z = 'lemon'), something(v = c('apple', 'pear'), z = c('cherry', 'strawberry', 'grape'))" I want to extract the segments encapsulated by something( and…
Chr
  • 1,017
  • 1
  • 8
  • 29
1
vote
3 answers

How can I select certain periods from a 10 year data set?

I have a set of data giving information about the daily precipitation between 2013-01-01 and 2022-12-31. Date Precipitation_value 1 2013-01-01 3.7 2 2013-01-02 0.1 3 2013-01-03 0.6 4 …
Marle Mü
  • 17
  • 3
1
vote
2 answers

Extract value from a JSON array with no name

I have a table with a record that has JSON content and it is an array with no name. Sample data looks like…
sinDizzy
  • 1,300
  • 7
  • 28
  • 60
1
vote
1 answer

Recognizing drop caps in PDF in python

I'm currently using pymupdf to extract text blocks from a file in python. import fitz doc = fitz.open(filename) for page in doc: text = page.get_text("blocks") for item in text: print(item[4]) The problem is that drop caps are…
Esraa Abdelmaksoud
  • 1,307
  • 12
  • 25
1
vote
2 answers

Extract x amount of characters before and after certain strings in column

I am trying to find a way to look for 10 words in every row within one column. For example: words <- c('corona','covid','infection','positive','test','negative','result','antigen','covid19','unknown') I have a df with 2 columns, 'ID' and 'comment'…
Debbie Oomen
  • 197
  • 1
  • 7
1
vote
2 answers

How extract picture from pdf file

I want to extract picture from pdf files by C++,but I don't understand the picture format in pdf files,does someone can help me? I looked the content of pdf files by opening it with Notepad, I tried to unzip the content and failed to extact pictures
zcnc
  • 13
  • 3
1
vote
1 answer

Data extraction from existing series in pandas

In my infinite quest of teaching myself program, I have come to find that the easiest questions to ask about doing something is the HARDEST to grasp (programming-wise). Case in point: Let's say I have an output from printing out a data frame: Work…
VJ1222
  • 21
  • 1
1
vote
1 answer

Extract and Print only accent characters through regular expression in JAVA

Been trying to extract only accent characters[a particular word] from a multiple text files in a folder. Don't want to remove or convert accent characters to normal characters but print only those characters which are accent in multiple text files…
RONNY
  • 13
  • 3
1
vote
1 answer

How to convert a human-readable timeline to table using existing ML tools?

I have this timeline from a newspaper produced by my Native American tribe. I was trying to use AWS Textract to produce some kind of table from this. AWS Textract does not recognize any tables in this. So I don't think that will work (perhaps more…
Joey Morrow
  • 341
  • 2
  • 4
1
vote
0 answers

Why does my Code using Selenium has such a long iteration time in a for-loop in Python? (Chromedriver)

I'm a beginner in Python, so please be patient with me. I want to extract some simple data from an array of URLs. All the URLs HTML-Contents have the same structur, so extracting the data by using a for-loop works out fine. I use Selenium, because I…
Do0dl3r
  • 31
  • 4
1
vote
1 answer

Error when trying to extract values from a raster layer to each shapefile using R

I'm trying to extract the values from a raster layer to a shapefile layer, but the process is extremely time consuming, but 2 hours without me getting any result. In general considering the size of the polygons and the raster this process should not…
wesleysc352
  • 579
  • 1
  • 8
  • 21
1
vote
2 answers

extracting strings from a dataframe row containing multiple entries?

I have a messy csv dataset that contains several (but not all) rows that unfortunately contains multiple entries. For each row, I'd like to separate each entry out so that i can create a list of the unique values (in this case, a list of specific…
1
vote
1 answer

extract upper case from the title pandas

I have a dataset, in which the column looks like this: col AMPCO Impact Socket MEGGAR HARLEY Impact Socket Is there any way where I can be able to extract AMPCO, MEGGAR HARLEY? Even if I can get MEGGAR from second sentence, that would also work. I…
1
vote
2 answers

checking the data in the master list and comparing it with the column within dataframe

Existing Dataframe: Id status countries 01 pass ['xyx','Indonesia','brazil'] 02 fail ['PQ','XT','sri lanka'] 03 pass ['spain', 'india','xtx'] Expected Dataframe : Id status …
Romi
  • 181
  • 7
1 2 3
99
100