Questions tagged [extract]

Questions related to retrieving specific information from a (typically minimally structured) data source, such as a web site, media file, source code collection or compressed archive (in which case the desired information is one or more original, uncompressed files). When using this tag, please include additional tags to clarify which specific environment/language/scenario your question refers to.

Data extraction is a term with many different but related meanings, including:

  • Parsing files (such as HTML pages) or file metadata in order to obtain certain information. This often involves

  • Retrieving single frames from audio, video or image files

  • Breaking up functionality in a single source code unit (e.g. a function) into multiple units:

  • Retrieving the original files from a (optionally compressed) archive file, such as a .zip or .tar file.

and should be added as a synonym for this tag.

6876 questions
41
votes
3 answers

Extract string before "|"

I have a data set wherein a column looks like this: ABC|DEF|GHI, ABCD|EFG|HIJK, ABCDE|FGHI|JKL, DEF|GHIJ|KLM, GHI|JKLM|NO|PQRS, BCDE|FGHI|JKL .... and so on I need to extract the characters that appear before the first | symbol. In…
40
votes
3 answers

Extract string from between quotations

I want to extract information from user-inputted text. Imagine I input the following: SetVariables "a" "b" "c" How would I extract information between the first set of quotations? Then the second? Then the third?
Reznor
  • 1,235
  • 5
  • 14
  • 23
40
votes
12 answers

Extracting information from PDFs of research papers

I need a mechanism for extracting bibliographic metadata from PDF documents, to save people entering it by hand or cut-and-pasting it. At the very least, the title and abstract. The list of authors and their affiliations would be good. Extracting…
Christopher Gutteridge
  • 4,425
  • 2
  • 21
  • 20
39
votes
7 answers

How to extract data from a PDF file while keeping track of its structure?

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs. I have tried a few of different things,…
Marcel
  • 6,143
  • 15
  • 46
  • 52
39
votes
6 answers

Extract .xip file into a folder from command line?

Apple occasionally uses a proprietary XIP file format, particularly when distributing versions of Xcode. It is an analog to zip, but is signed, allowing it to verified on the receiving system. When a XIP file is opened (by double-clicking), Archive…
Antony Raphel
  • 2,036
  • 2
  • 25
  • 45
39
votes
3 answers

C# regex pattern to extract urls from given string - not full html urls but bare links as well

I need a regex which will do the following Extract all strings which starts with http:// Extract all strings which starts with www. So i need to extract these 2. For example there is this given string text below house home go www.monstermmorpg.com…
Furkan Gözükara
  • 22,964
  • 77
  • 205
  • 342
38
votes
6 answers

Extract digits from string - Google spreadsheet

In Google spreadsheets, I need a formula to extract all digits (0 to 9) contained into an arbitrary string, that might contain any possible character and put them into a single cell. Examples (Input -> Output) d32Ελληνικάfe9j.r/3-fF66 ->…
thanos.a
  • 2,246
  • 3
  • 33
  • 29
38
votes
3 answers

JAR - extracting specific files

I have .class and .java files in JAR archive. Is there any way to extract only .java files from it? I've tried this command but it doesn't work: jar xf jar-file.jar *.java
user3521479
  • 565
  • 1
  • 5
  • 12
36
votes
2 answers

Extract part of a git repository?

Assume my git repository has the following structure: /.git /Project /Project/SubProject-0 /Project/SubProject-1 /Project/SubProject-2 and the repository has quite some commits. Now one of the subprojects (SubProject-0) grows pretty big, and I…
Rio
  • 1,877
  • 3
  • 25
  • 25
36
votes
3 answers

Extract a ZIP file programmatically by DotNetZip library?

I have a function that get a ZIP file and extract it to a directory (I use DotNetZip library.) public void ExtractFileToDirectory(string zipFileName, string outputDirectory) { ZipFile zip = ZipFile.Read(zipFileName); …
Ehsan
  • 3,431
  • 8
  • 50
  • 70
35
votes
3 answers

How can I untar a tar.bz file in unix?

I've found tons of pages saying how to untar tar.bz2 files, but how would one untar a tar.bz file?
Tim
  • 4,295
  • 9
  • 37
  • 49
34
votes
6 answers

How do you extract a url from a string using python?

For example: string = "This is a link http://www.google.com" How could I extract 'http://www.google.com' ? (Each link will be of the same format i.e 'http://')
Sheldon
  • 9,639
  • 20
  • 59
  • 96
33
votes
6 answers

Extracting text from PDFs in C#

Pretty simply, I need to rip text out of multiple PDFs (quite a lot actually) in order to analyse the contents before sticking it in an SQL database. I've found some pretty sketchy free C# libraries that sort of work (the best one uses iTextSharp),…
Duncan Tait
  • 1,997
  • 4
  • 20
  • 24
33
votes
3 answers

Java library for keywords extraction from input text

I'm looking for a Java library to extract keywords from a block of text. The process should be as follows: stop word cleaning -> stemming -> searching for keywords based on English linguistics statistical information - meaning if a word appears more…
Shay
  • 497
  • 1
  • 4
  • 10
31
votes
3 answers

Extract first word from a column and insert into new column

I have a dataframe below and want to extract the first word and insert it into a new column Dataframe1: COL1 Nick K Jones Dave G Barros Matt H Smith Convert it to this: Dataframe2: COL1 COL2 Nick K Jones Nick Dave G Barros …
Nick
  • 833
  • 2
  • 8
  • 11