Questions tagged [html-content-extraction]

Techniques for predicting/detecting certain article text and extracting it from a particular document.

Techniques for predicting/detecting certain article text and extracting it from a particular document. Also referred to as web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Internet Explorer or Mozilla Firefox.

211 questions
2
votes
4 answers

How can I extract HTML content efficiently with Perl?

I am writing a crawler in Perl, which has to extract contents of web pages that reside on the same server. I am currently using the HTML::Extract module to do the job, but I found the module a bit slow, so I looked into its source code and found out…
Alvin
  • 10,308
  • 8
  • 37
  • 49
1
vote
1 answer

iOS - Converting HTML to Normal text

In my application, I'm receiving an html file from a news server. After receiving, I want to remove the tags, images, URL anchors, etc and just show the text in text view. There's a website which functions similar to the one that I'm looking for.…
Satyam
  • 15,493
  • 31
  • 131
  • 244
1
vote
1 answer

How to take data/text off a website and input it in my android app via text view, list view etc?

im trying to create an application where a page has its text created automatically by reading off a website? I understand an array would be used and string formatting, i am ok with android programming but not an expert lol. I had tried using a set…
1
vote
5 answers

How to collect data from a website

Preface: I have a broad, college knowledge, of a handful of languages (C++, VB,C#,Java, many web languages), so go with which ever you like. I want to make an android app that compares numbers, but in order to do that I need a database. I'm a one…
Mr. MonoChrome
  • 1,383
  • 3
  • 17
  • 39
1
vote
1 answer

Non mshtml c# parsing html and javascript

I'm looking for a way to parse a html document with javascript embedded. I know that this can be done with MSHTML and code DOM, but in this case it is not an option. I need the program to be also able to run on Mono. Any suggestions?
Arsen Zahray
  • 24,367
  • 48
  • 131
  • 224
1
vote
1 answer

.NET 5 HttpClient cannot GET html content page - http 500

I'm trying to use HttpClient to get html content of a page. to try the method I tested with the google URL, and it's working, I receive the content of my html page. but with url I want, impossible to get a content. I have each time a return code…
Kujima
  • 13
  • 3
1
vote
1 answer

Extract all Images from HTML whose width or height higher than a specified value - Regex

I'm trying to make a small link share function with Classic ASP like LinkedIn or Facebook. What I need to do is to get HTML of remote URL and extract all the images whose width are greater than 50px for example. I can crawl and take the HTML and…
Burak F. Kilicaslan
  • 535
  • 2
  • 8
  • 20
1
vote
2 answers

To bypass referral check

Is there any way to bypass the referral check applied by some site in order to avoid there data from being extracted. Like if you follow this link! You will get Access Denied Error. However , if you just go this link!, it takes you to home page and…
Prashant Singh
  • 3,725
  • 12
  • 62
  • 106
1
vote
1 answer

Extract file from http response in Azure logic app

I have an Azure function (http triggered) which returns a CSV file in response. I am calling this function from a logic app using http request action (since I need to pass authentication details) and getting the http response with the CSV in body.…
1
vote
3 answers

Screen-scraping for PDF links to download

I'm learning C# through creating a small program, and couldn't find a similar post (apologies if this answer is posted somewhere else). How might I go about screen-scraping a website for links to PDFs (which I can then download to a specified…
1
vote
4 answers

extract the main part of a page in java

Hello I have a page of a personality in wikipedia and I want to extract with java source a code HTML from the main part is that. Do you have any ideas?
user651584
  • 49
  • 1
  • 7
1
vote
3 answers

How to get the value of a row extracted using jQuery

I have a table and I'm retrieving each table row by doing this: $(function(){ $('table tr').click(function(){ var $row = $(this).html(); alert($row); }); }); This gets me the current row like…
Tsundoku
  • 9,104
  • 29
  • 93
  • 127
1
vote
1 answer

HTML XPath: Extracting text mixed in with multiple level and complex tags?

related questions before: HTML XPath: Extracting text mixed in with multiple tags? HTML XPath: Selectively avoiding tags when extracting text //sorry for my poor English I'm a beginner of writing web crawler, I'm trying to extract main content from…
1
vote
1 answer

reading web page source code in java Differs from the orginal webpage source code

I am trying to implement program to read webpage source code and save it in text file then do some operations in it but the problem when I read web page source code , there is difference between the orginal web page source code and the output of…
Oghli
  • 2,200
  • 1
  • 15
  • 37
1
vote
1 answer

Best visible content extractor available

So my application needs visible content from a given URL, like just the text part, no html no header or footer data. As of now I am using beautifulsoup and boilerpipe for getting the same. But in some rare cases I am not getting enough data or the…