Questions tagged [extract]

Questions related to retrieving specific information from a (typically minimally structured) data source, such as a web site, media file, source code collection or compressed archive (in which case the desired information is one or more original, uncompressed files). When using this tag, please include additional tags to clarify which specific environment/language/scenario your question refers to.

Data extraction is a term with many different but related meanings, including:

  • Parsing files (such as HTML pages) or file metadata in order to obtain certain information. This often involves

  • Retrieving single frames from audio, video or image files

  • Breaking up functionality in a single source code unit (e.g. a function) into multiple units:

  • Retrieving the original files from a (optionally compressed) archive file, such as a .zip or .tar file.

and should be added as a synonym for this tag.

6876 questions
93
votes
6 answers

Read Content from Files which are inside Zip file

I am trying to create a simple java program which reads and extracts the content from the file(s) inside zip file. Zip file contains 3 files (txt, pdf, docx). I need to read the contents of all these files and I am using Apache Tika for this…
S Jagdeesh
  • 1,523
  • 2
  • 28
  • 47
92
votes
8 answers

Extract and delete all .gz in a directory- Linux

I have a directory. It has about 500K .gz files. How can I extract all .gz in that directory and delete the .gz files?
user2247643
  • 953
  • 1
  • 6
  • 6
88
votes
9 answers

how to extract only the year from the date in sql server 2008?

In sql server 2008, how to extract only the year from the date. In DB I have a column for date, from that I need to extract the year. Is there any function for that?
Praveen
  • 55,303
  • 33
  • 133
  • 164
85
votes
5 answers

pandas extract year from datetime: df['year'] = df['date'].year is not working

I import a dataframe via read_csv, but for some reason can't extract the year or month from the series df['date'], trying that gives AttributeError: 'Series' object has no attribute 'year': date Count 6/30/2010 525 7/30/2010 136 8/31/2010 …
MJS
  • 1,573
  • 3
  • 17
  • 26
84
votes
8 answers

How to parse the Manifest.mbdb file in an iOS 4.0 iTunes Backup

In iOS 4.0 Apple has redesigned the backup process. iTunes used to store a list of filenames associated with backup files in the Manifest.plist file, but in iOS 4.0 it has moved this information to a Manifest.mbdb You can see an example of this…
Padraig
  • 1,569
  • 2
  • 15
  • 21
83
votes
8 answers

Extracting an information from web page by machine learning

I would like to extract a specific type of information from web pages in Python. Let's say postal address. It has thousands of forms, but still, it is somehow recognizable. As there is a large number of forms, it would be probably very difficult to…
Honza Javorek
  • 8,566
  • 8
  • 47
  • 66
72
votes
5 answers

Extract files from zip without keeping the structure using python ZipFile?

I try to extract all files from .zip containing subfolders in one folder. I want all the files from subfolders extract in only one folder without keeping the original structure. At the moment, I extract all, move the files to a folder, then remove…
Thammas
  • 973
  • 2
  • 9
  • 14
68
votes
11 answers

Extract the text out of HTML string using JavaScript

I am trying to get the inner text of HTML string, using a JS function(the string is passed as an argument). Here is the code: function extractContent(value) { var content_holder = ""; for (var i = 0; i < value.length; i++) { if…
Toshkuuu
  • 805
  • 1
  • 7
  • 9
68
votes
8 answers

How to extract top-level domain name (TLD) from URL

how would you extract the domain name from a URL, excluding any subdomains? My initial simplistic attempt was: '.'.join(urlparse.urlparse(url).netloc.split('.')[-2:]) This works for http://www.foo.com, but not http://www.foo.com.au. Is there a way…
hoju
  • 28,392
  • 37
  • 134
  • 178
67
votes
8 answers

Extract MSI from EXE

I want to extract the MSI of an EXE setup to publish over a network. For example, using Universal Extractor, but it doesn't work for Java Runtime Environment.
emdadgar2
  • 820
  • 1
  • 9
  • 15
59
votes
6 answers

How to extract just plain text from .doc & .docx files?

Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx? I've found this - wondered if there were any other suggestions?
docextract
  • 663
  • 1
  • 6
  • 3
56
votes
4 answers

Java: export to an .jar file in eclipse

I'm trying to export a program in Eclipse to a jar file. In my project I have added some pictures and PDF:s. When I'm exporting to jar file, it seems that only the main has been compiled and exported. My will is to export everything to a jar file…
Adis
  • 804
  • 1
  • 9
  • 15
54
votes
6 answers

How to extract one file with commit history from a Git repo with index-filter & co?

I have a Git repo converted from SVN to Mercurial to Git, and I wanted to extract just one source file. I also had weird characters like aÌ (an encoding mismatch corrupted Unicode ä) and spaces in the filenames. How can I extract one file from a…
peterhil
  • 1,536
  • 1
  • 11
  • 18
48
votes
9 answers

Get min and max value in PHP Array

I have an array like this: array (0 => array ( 'id' => '20110209172713', 'Date' => '2011-02-09', 'Weight' => '200', ), 1 => array ( 'id' => '20110209172747', 'Date' => '2011-02-09', 'Weight' => '180', ), 2 =>…
Peter
  • 1,264
  • 5
  • 20
  • 41
47
votes
17 answers

What is so wrong with extract()?

I was recently reading this thread, on some of the worst PHP practices. In the second answer there is a mini discussion on the use of extract(), and im just wondering what all the huff is about. I personally use it to chop up a given array such as…
barfoon
  • 27,481
  • 26
  • 92
  • 138