2

As part of a migration task of data, I am extracting some data from some html, the values in alt and title attributes of the img html element using PHP.

An example of the source html is:

<img src='myimage.jpg' alt='Andy's garden vegetables' title='Andy's garden vegetables'/>

As you can see, in the source html, the values of the alt and title attributes have their start and finish (container characters) denoted by a single apostrophe ' But within the text itself, the single apostrophe is used in possessive ownership sense to say vegetables belonging to Andy.

So for a simple parser, this is going to be problematic as it would incorrectly regard the apostrophe within the text as the end of the value, as in 'Andy' rather than 'Andy's garden vegetables'.

The solution I can think of to incorporate further surrounding text into a regex to clarify the start and finish of the attribute value, as in the alt=' and the ' at the end. Though this would not work if there are spaces between the = or if double quotes were used. I think that the ' single apostrophes may not be legal html but that is the data I have to work with.

Is there a more robust solution than regex, perhaps html dom based that can handle ' single apostrophes within the text and distinguish them from being used as delimiters?

therobyouknow
  • 6,604
  • 13
  • 56
  • 73
  • 2
    The `img` tag is invalid to begin with. If the value of attributes contain single-quotes then the value itself must be enclosed within double-quotes. e.g. `Andy's garden vegetables` Since the HTML is invalid to begin with, I don't think any DOM based parsing is going to help you. You may have to write some logic to clean up the HTML or extract what you need yourself. – Susam Pal Nov 18 '13 at 08:43
  • 1
    I tried repairing your code by Tidy. It didn't work. So I'm guessing there is no "lenient DOM parser" that can ignore such fatal DOM errors. So I'm concluding this has to be some kind of regex only. – Manu Manjunath Nov 18 '13 at 10:28
  • +1 Good to know and to eliminate DOM from possible solutions. Thanks Manu. – therobyouknow Nov 18 '13 at 10:40

2 Answers2

0

This matches your sample data's alt and title fields by using look arounds with alternate content and a reluctant quantifier (.+?) to ensure the match doesn't skip past quotes to end on the last quote in the input:

(?<=alt='|title=').+?(?='(\s|/))

See a live demo of this regex working with your sample and some edge cases.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
-1

I think this is what you're asking for?:

(?<=alt='|title=').+(?='\s)

I just used positive lookahead/lookbehind to identify the tags and the closing single apostrophe.

Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56
  • Can you give a code sample I can run for this? Not sure how to apply it. Thanks. – therobyouknow Nov 18 '13 at 08:54
  • I'm not that great with php, but I think the following is how you would match it: `"; $pattern = '/(?<=alt='|title=').+(?='\s)/'; preg_match($pattern, $subject, $matches); print_r($matches); ?>` – Vasili Syrakis Nov 18 '13 at 08:59
  • Thanks, getting error message about one of the `=` guessing it has not been escaped - http://ideone.com/hHIgYx So tried to escape with a backslash `\\` but still an error: http://ideone.com/tiu8WI – therobyouknow Nov 18 '13 at 09:08
  • Can't get this to work - still errors. Once I do, I'll hopefully accept the answer... – therobyouknow Nov 18 '13 at 09:23
  • errors about unexpected `=` and when I escape these I get `PHP Warning: preg_match(): Compilation failed: unrecognized character after (? or (?- at offset 25 in /home/lw7VZd/prog.php`. Once I do, I'll hopefully accept the answer... – therobyouknow Nov 18 '13 at 09:29