Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

211 questions
0
votes
1 answer
0
votes
2 answers

Extracting description out of an html page

I am trying to extract title and description out of web pages, using DOMdocument(), I am successful in extracting title like this $d=new DOMDocument(); $d->loadHTML($html); $title=$d->getElementsByTagName("title")->item(0)->textContent; I can…
Sourabh
  • 1,757
  • 6
  • 21
  • 43
0
votes
1 answer

how to run and get document stats from boilerpipe article extractor?

There's something I'm not quite understanding about the use of boilerpipe's ArticleExtractor class. Albeit, I am also very new to java, so perhaps my basic knowledge of this enviornemnt is at fault. anyhow, I'm trying to use boilerpipe to extract…
brneuro
  • 326
  • 1
  • 5
  • 15
0
votes
1 answer

double encoded html code

I use xinha as WYSIWYG editor for html-content. I sent html-articles via post-form to postgresql. So far so good, they seem ok. But when I receive and output from pg to an html page, I see double encoded, i.e. broken html code like…
-1
votes
2 answers

Lstrip and Rstrip won't work, need help removing text from an output in Python 3

The output is part of a list. When I try to figure out the output's type using type() it returns : . I am trying to remove everything to the left of "href" and everything to the right of "
-1
votes
2 answers

Unable to select html element by class name in python selenium

I am trying to select I NEED THIS TEXT from the last line of the following html code and did not yet have success so far:
sudonym
  • 3,788
  • 4
  • 36
  • 61
-1
votes
1 answer

How can I remove the enclosing tags around a piece of HTML?

I am creating a custom filter for text using the asciidoc syntax for Drupal using the customfilter module. I enclose it in [asciidoc][/asciidoc] tags and when I run it through the asciidoctor command the output is enclosed in
vfclists
  • 19,193
  • 21
  • 73
  • 92
-1
votes
1 answer

How do I extract HTML content using Regex in PHP

I know, i know... regex is not the best way to extract HTML text. But I need to extract article text from a lot of pages, I can store regexes in the database for each website. I'm not sure how XML parsers would work with multiple websites. You'd…
HyderA
  • 20,651
  • 42
  • 112
  • 180
-1
votes
2 answers

Extract news links from news website

Is there any reliable method to find out the collection of links which is directed us to detail news page. in other word after visiting the first page of website I just want those links that refer to a news item. any solution ?
-1
votes
1 answer

What is the best method to extract relevant info from Email?

My friend has a small business where customers order services using email. He receives several emails a day and sorting thru it is becoming cumbersome. There are about 10 different kind of tasks the customer can request, and for each there are one…
-1
votes
1 answer

Extract table from html using Java or Javascript

I have a html files called page1.html, page2.html. In page1.html and page2.html I have a few content inside table element, now I want to extract those table contents and put it in new file called summary.html. I don't know jQuery, so how to do this…
-1
votes
2 answers

How to read some portion of a webpage and store its text in an excel file

I have downloaded some website by a website copier software. I want to do extract some information from all pages. Suppose there are many product pages and I want to gather only product information from all pages and store it in a excel file. I want…
Abhinav
  • 3,322
  • 9
  • 47
  • 63
-1
votes
2 answers

Extracting numerical data from "data-" in html

In the HTML below, I want to simply divide two numbers and return the result to the page. The javascript accomplishes this, however, the variable A (GAL2_G2rAimTonsPerHr) is updated every 10 seconds or so from our PI server (historian). How can I…
J.C.Morris
  • 803
  • 3
  • 13
  • 26
-2
votes
3 answers

Extract a content of a html page in php

There is any way to extract the content of a HTML page that starts from and ends with in php. If there can anyone post some sample code.
bharathi
  • 6,019
  • 23
  • 90
  • 152
-2
votes
2 answers

How can I find feed or XML of a particular news source

I want to get xml file of a particular news source, Of if there is any project which converts html news to xml, parsing page and tokenizing its various traits such as date, author name, title, content etc. in a single xml or similar type of file.…
puneet
  • 129
  • 7
1 2 3
14
15