3

I have a table named sentEmails where the body column contains the body text of an email.

In the body text, there is a substring like:

some link: <a href="https://somelink@somesite.com/somePage.php?someVar=someVal&sentby=agent">Random link text

Using MySql, I need to extract the url from this column like https://somelink@somesite.com/somePage.php?someVar=someVal&sentby=agent

I was thinking something like the below would work by finding the starting location and returning the next 150 chars, of course it actually just returns the first 150 chars.

SELECT LEFT(body, LOCATE('some link: <a href="', body)+150)  AS link
FROM sentEmails
WHERE sent between date_sub(now(),INTERVAL 1 WEEK) and now()
AND body like '%some link:%'
AND toEmail = 'email@gmail.com'

Additional info:

  • the link will always be preceded by the text some link:
  • Random link text at the end will change
  • I can live with getting a bit more of the text than need if I have to, for example, getting https://somelink@somesite.com/somePage.php">Random link text would be acceptable
  • the text shown above is a substring of the full body column which contains much more text
  • This isnt something Im going to be doing often. Im researching an issue and I need the links from 40-50 of these rows, Im just hoping to avoid having to pull the link manually from each row.
  • I can only use MySQL Query Browser to access this DB if I could connect with php, this would be trivial
  • The url in question, can have 6-25 parameters in it
  • The url in question will always end with this parameter &sentby=agent
Wesley Smith
  • 19,401
  • 22
  • 85
  • 133

4 Answers4

0

If you had two unique delimiters around the URL, then could just use SUBSTRING() to isolate it. One approach would be to replace the two sides of the URL in the anchor tag with a delimeter:

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(
    REPLACE(REPLACE(body, '<a href="', '~'), '&sentby=agent">', '&sentby=agent~'), '~', -2),
    '~', 1)
FROM sentEmails
WHERE sent BETWEEN DATE_SUB(NOW(), INTERVAL 1 WEEK) AND NOW() AND
    body LIKE '%some link:%' AND
    toEmail = 'email@gmail.com'

I replaced <a href=" and "> with ~. If ~ does not occur anywhere in the body column, and if you only have one HTML tag in the body, then this should work.

If the body column is just a big chunk of HTML, then you should consider using xpath and handling this in your app layer.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • What if there are other quotes not around links? Like classes or any other occurence? – Nico Andrade Jun 01 '16 at 04:29
  • This is really close, but as Nico points out, the content does indeed have other "s. I tried updating this to `SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(body, 'some link: – Wesley Smith Jun 01 '16 at 04:39
  • Unfortunately, the body column is indeed just a big chunk of HTML. This isnt something Im going to be doing often. Im researching an issue and I need the links from 40-50 of these rows, Im just hoping to avoid having to pull the link manually from each row. Actually, I just checked to be sure and the url will always end with `&sentby=agent` – Wesley Smith Jun 01 '16 at 05:03
  • You should be using an XML parser for this, rather than the string functions of MySQL. Nevertheless, I updated my answer again and you can give it a try. – Tim Biegeleisen Jun 01 '16 at 05:20
  • Thanks Tim, I updated the question to also state that I am only allowed to use MySQL Query Browser to access this DB. I didnt think that was relevant since the SQL is all still the same but given the points about better ways this should be done, I see it is an important detail. Im home for the night but Ill try your update tomorrow and let you know. Thanks again – Wesley Smith Jun 01 '16 at 05:27
0

if you're just trying to extract the link out, can you do instr() and mid function. something like this

select mid(body,substr(body,'="'),substr(body,'">')-substr(body,'="')) from email...

substr(body,'="') = starting position of the link =" and substr(body,'">') is the end position of the link.

MID function takes (str,pos, len) and len = end position - starting position

chungtinhlakho
  • 870
  • 10
  • 21
  • Its a bit more complicated than that since the link is buried in a large mound of html that includes numerous other links and the link in question itself can have various other parameters – Wesley Smith Jun 01 '16 at 05:37
  • give us a real example of what it looks like and what the results should look like. – chungtinhlakho Jun 01 '16 at 05:43
  • Updated the question. The only difference is that there are query strings on the url and one plus is that I confirmed the URLs will always end with `sentby=agent` – Wesley Smith Jun 01 '16 at 05:51
  • I'm sorry, a little confuse, you know what it preceded by and ended with. will some text: and sentby=agent appear more than 1? – chungtinhlakho Jun 01 '16 at 06:04
  • No, that bit will only ever occur once. If I could use a regex, I'd know how to do it, but from what I found, regex won't work in the select statement – Wesley Smith Jun 01 '16 at 07:00
0

Thanks to Tim's help, I was able to get this working with the below query:

SELECT SUBSTRING_INDEX( SUBSTRING_INDEX(body, 'some link:  <a href="', -1) , 'sentby=agent">', 1) AS link
FROM sentEmails
where sent between date_sub(now(),INTERVAL 1 WEEK) and now()
AND body like '%some link:%'
AND toEmail = 'email@gmail.com'
Wesley Smith
  • 19,401
  • 22
  • 85
  • 133
-2

Doing this kind of search is not convenient. As the table with emails grows in size, the query will be less and less performant.

If this is a new application you're building, you're better with keeping a separate table with the list of URLs used on each sent email. You'd write the URLs to the DB as you send the emails.

The reasoning of this is that the App will do more searches in the DB than sending emails. Therefore, by doing a little extra work when sending emails, you help a lot in the most-expensive usage of the feature, which is the search.

If you still decide to keep the current approach, you'll want to have an index containing the columns (toEmail, sent) in this order.

Other than that, your approach makes sense and will work. Did you actually try it? Does it work for you?

Nico Andrade
  • 880
  • 5
  • 16
  • I did try it but it just returns the first 150 chars. You make valid points. But this isnt something Im going to be doing often and wont end up in any production code. Im researching an issue and I need the links from 40-50 of these rows, Im just hoping to avoid having to pull the link manually from each row as they hold a ton of text and I can only use MySql Browser to access the DB. If I could connect with php, this would all be trivial – Wesley Smith Jun 01 '16 at 05:20
  • You'll have another issue. This query will return only 1 match per email. In case you have more than one link per email, you won't get the 2nd one. – Nico Andrade Jun 02 '16 at 03:54
  • Thanks, there is only one target link in each email – Wesley Smith Jun 02 '16 at 04:03