0

I am trying to grab certain id's out of HTML code. I have some of it working, but other things I need help with. Here is some sample HTML code of videos:

<video id="movie1" class="show_movie-camera animation_target movieBorder hasAudio movieId_750" src="/path/to/movie" style="position: absolute; z-index: 505; top: 44.5px; left: 484px; display: none;" preload="true" autoplay="true"></video>
<video id="movie2" class="clickInfo movieId_587" src="/path/to/movie" preload="true" autoplay="true"></video>
<video id="movie300" src="/path/to/movie" preload="true" autoplay="true"></video>

To get the movie id's, I look for movieId_[ID] or movie[ID] using this regex:

.*?<object|<video.*?movie(\\d+)|movieId_(\\d+)[^>]*>?.*?

This works well, but it puts both movieId_[ID] AND movie[ID] in the matches, rather than just one. What I am looking for is to use movieId_[ID] and using movie[ID] as the fallback. This is what I use:

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(content);
int fileId = -1;
while(m.find()) {
    fileId = -1;
    if (m.group(2) != null) {
        fileId = new Integer(m.group(2));
    } else if (m.group(1) != null) {
        fileId = new Integer(m.group(1));
    }
}

This will give me 1, 750, 2, 587, 300 instead of 750, 578, 300 that I am looking for.

Additionally, I am looking to get the matches that have the hasAudio class. Here is what I tried with no success:

.*?<object|<video.*?hasAudio.*movieId_(\\d+)|movieId_(\\d+).*hasAudio[^>]*>?.*?";

Any help would be appreciated. Thanks!

fanfavorite
  • 5,128
  • 1
  • 31
  • 58
  • Yes, sorry that has been corrected. – fanfavorite Oct 04 '17 at 18:35
  • 6
    [You shouldn't use regex to parse html](https://stackoverflow.com/a/1732454/6073886) – OH GOD SPIDERS Oct 04 '17 at 18:35
  • Better to use something like jsoup? The HTML is contents in a database table that gets pulled and then processed. – fanfavorite Oct 04 '17 at 18:38
  • Your regex is missing some groups. You're current scanning for `.*?]*>?.*?`. Including `.*?` in the beginning of the first part is redundant, so is including `.*?` in the end of the third part, since `find()` will scan for matching patterns anyway. And since you're looking for movie ids, why scan for ` – Andreas Oct 04 '17 at 18:42
  • Ok thanks. Object is for old flash files. – fanfavorite Oct 04 '17 at 18:44

1 Answers1

2

For the first issue check the below...

.*?<object|<video[^>]*((?<=movieId_)\d+|(?<=movie)\d+)

To make it work your java code would be

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(content);
int fileId = -1;
while(m.find()) {
    fileId = -1;
    if (m.group(1) != null) {
        fileId = new Integer(m.group(1));
    }
}

Demo of regex here.


UPDATE FOR SECOND CONDITION

.*?<object|<video[^>]*hasAudio[^>]*((?<=movieId_)\d+|(?<=movie)\d+)

Demo of regex here


Explanation

.*?<object                 //Already existing regex
|                          //OR capture the movie ID as below
<video[^>]*hasAudio[^>]*   //Part of full match include all characters except '>'
                           //This makes sure matches do not go beyond the tag
                           //Also makes sure that hasAudio is part of this string
(                          //START: Our Group1 capture as Movie ID 
(?<=movieId_)\d+           //First try getting id out of moviedId_xxx
|                          //OR if first fails
(?<=movie)\d+              //Second try getting id out of moviexxx
)                          //END: Our Group1 capture as Movie ID

Note: .*?<object would always match only <object!!!


UPDATE 2

<object|<video[^>]*\K(?:hasAudio[^>]*\K(?:(?<=movieId_)\d+|(?<=movie)\d+)|(?:(?<=movieId_)\d+|(?<=movie)\d+)(?=[^>]*hasAudio))

Here I introduced condition for trailing hasAudio if any. Note that in this regex the full match is the movieID, there would be no groups.

Main feature we used here is the \K flag which resets the match position to current. There by dropping all previously grabbed chars from the match. This helps us get around variable length look-behind.

Demo here

kaza
  • 2,317
  • 1
  • 16
  • 25