2

This question revolves around using CONCAT and SUBSTRINGS to UPDATE the content in mysql TABLE

If I have a link stored in a mysql database that consists of

https://this.example.com/work-a-link.php?https://that.example.com/thisisworthkeeping.php

The following code will remove the first instance and leaves the second instance of the url with a prefix of http:// url ( when in theory in this case it should be https:// - since the second URL began with an https://

DROP TABLE IF EXISTS content_temp;
CREATE TABLE content_temp AS SELECT * FROM content GROUP BY Title ORDER BY ID ASC;
UPDATE content_temp SET link = CONCAT('http://', SUBSTRING_INDEX(link, 'https://', -1));
UPDATE content_temp SET link = CONCAT('http://', SUBSTRING_INDEX(link, 'http://', -1));
UPDATE content_temp SET link=replace(link,'http://https://','http://');
UPDATE content_temp SET link=replace(link,'http://http://','http://');

And that is the problem I am seeking the assistance of someone who can help me out or point me in the right direction - I want any https:// in the second instance to be left as https:// - whereas my bothched up code is changing it to http://

What I want to accomplish is to respect and keep whatever the second instance is - whether it was http OR https.

The only other random factor in all this... is that the first instance may be http OR https

So the end result of rejigging this code would be :

( A )

http://this.example.com/work-a-link.php?http://that.example.com/thisisworthkeeping.php 

would give http://that.example.com/thisisworthkeeping.php.

( B )

http://this.example.com/work-a-link.php?https://that.example.com/thisisworthkeeping.php 

would give https://that.example.com/thisisworthkeeping.php.

( C )

https://this.example.com/work-a-link.php?http://that.example.com/thisisworthkeeping.php 

would give http://that.example.com/thisisworthkeeping.php.

and

( D )

https://this.example.com/work-a-link.php?https://that.example.com/thisisworthkeeping.php 

would give https://that.example.com/thisisworthkeeping.php.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Snp
  • 21
  • 2
  • Do all your `link` values to be changed contain the string `.php?http` ? Can you split on the `?` ? – O. Jones May 02 '18 at 00:00
  • Code blocks surrounded in backticks or indented four spaces are your friends: URLs in them don't get made into links, so they don't get caught by the antispam stuff. Also, there's a reserved domain `example.com` to use for this kind of explanation. – O. Jones May 02 '18 at 00:11

2 Answers2

0

If all your links to be changed contain the substring .php? you can do this:

UPDATE content_temp SET link = SUBSTRING_INDEX(link, '.php?', -1)
      WHERE link LIKE '%.php?%'

Pro tip: It's always a good idea to put a WHERE clause on an UPDATE operation: without it SQL updates all rows in the table. That can trash the table, and it can swamp the transaction log.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
0

It looks like you have a total of four possible combinations we're looking at. The first can be either 'http' or 'https', and the second can be either 'http' or 'https'. Four possible combinations:

http:/  http:/
http:/  https:/
https:/ http:/
https:/ https:/

A couple of suggestions:

Firstly, before we execute an UPDATE statement that we're unsure that it is going to do what we intend it to do, we should write a SELECT statement to test the expressions. That let's us see the results of the expression, so we can verify that it is doing what we intend it to, under a variety of test conditions.

Secondly, it should be be possible to separate out the combinations (the four possibilities). If the changes we are applying is "removing" the first http/https leaving only one occurrence... then a modification we apply to one of the subsets won't make a row then move into a different subset. (I understood what I meant to say there, it probably came across garbled.)

If I've got row like this

http 1   https 2 

and I change that with an update to be

https 2 

Then a subsequent run through the rows, that's not going to match a check for

https  https 

because there's only one occurrence of http in the string.


Let's assume that we're only interested in rows where the link string contains two occurrences of http:// / https://

We could use a regular expression do some matching, or we can fudge it with some LIKE comparisons

SELECT t.link
     , t.link LIKE '%http://%http://%'   AS c1 
     , t.link LIKE '%http://%https://%'  AS c2 
     , t.link LIKE '%https://%http://%'  AS c3 
     , t.link LIKE '%https://%https://%' AS c4
  FROM (

SELECT 'http://somedomain.com/work-a-link.php?http://someotherdomain.com/thisisworthkeeping.php' AS link
UNION ALL
SELECT 'http://somedomain.com/work-a-link.php?https://someotherdomain.com/thisisworthkeeping.php'
UNION ALL
SELECT 'https://somedomain.com/work-a-link.php?http://someotherdomain.com/thisisworthkeeping.php'
UNION ALL 
SELECT 'https://somedomain.com/work-a-link.php?https://someotherdomain.com/thisisworthkeeping.php'

       ) t
 WHERE t.link LIKE '%http%://%http%://%'
   AND t.link NOT LIKE '%http%://%http%://%http%://%'  

returns something like this (with some string replacement in the link values to shorten them here) ...

link                                                           c1  c2  c3  c4
-------------------------------------------------------------- --  --  --  --
http://somedomain.com/walp?http://someotherdomain.com/tiwkp     1   0   0   0
http://somedomain.com/walp?https://someotherdomain.com/tiwkp    0   1   0   0
https://somedomain.com/walp?http://someotherdomain.com/tiwkp    0   0   1   0
https://somedomain.com/walp?https://someotherdomain.com/tiwkp   0   0   0   1

We've include a condition in the WHERE clause to "filter out" any URL where there are more than two occurrences of "http[s]://". (There's a little bit of fudging in there, the % wildcard could match any number of characters, so we're not exactly checking for https:// and http:// ... (again, we could implement a regular expression (REGEXP or RLIKE) comparison to be more exact.

But notice how we've managed to "categorize" the values link into c1, c2, c3, and c4.

Confirming that each link falls into a single category, We can include an additional expression in the SELECT list

     , CASE 
       WHEN t.link LIKE '%http://%http://%'   THEN 'c1'
       WHEN t.link LIKE '%http://%https://%'  THEN 'c2'
       WHEN t.link LIKE '%https://%http://%'  THEN 'c3' 
       WHEN t.link LIKE '%https://%https://%' THEN 'c4'
       ELSE NULL
       END

If we find a matching WHEN condition, we return the THEN, and we're done. (If there is an overlap, a link that falls into more than one category, we won't see it in this expression.)

link                                                            c   c1  c2  c3  c4
-------------------------------------------------------------- --  --  --  --  --
http://somedomain.com/walp?http://someotherdomain.com/tiwkp    c1   1   0   0   0
http://somedomain.com/walp?https://someotherdomain.com/tiwkp   c2   0   1   0   0
https://somedomain.com/walp?http://someotherdomain.com/tiwkp   c3   0   0   1   0
https://somedomain.com/walp?https://someotherdomain.com/tiwkp  c4   0   0   0   1

We can test with a variety of link values, various patterns, and verify that our matching "categorization" is working the way we intend.

(This technique of using a SELECT statement to test expressions, and especially functions we're not familiar with... with SUBSTRING_INDEX, what gets returned when the search string isn't found? is the REPLACE function case sensitive? and so on.)

We can adjust and tweak, experiment and find the right combinations to get it working the way we want. Once we get that,

We can include yet another expression in the SELECT list. We'll copy the one we just added, but this time, instead of returning a literal, we will include a different expression, one that exercises the SUBSTRING_INDEX and REPLACE functions.

Since we're running a SELECT we know that we're not going to mangle/munge the contents of the table. We're just testing some expressions, to see what they return.

     , CASE 
       WHEN t.link LIKE '%http://%http://%'    -- 'c1'
       THEN CONCAT('http://', SUBSTRING_INDEX(t.link, 'http://', -1))

       WHEN t.link LIKE '%http://%https://%'   -- 'c2'
       THEN CONCAT('https://', SUBSTRING_INDEX(t.link, 'https://', -1))

       WHEN t.link LIKE '%https://%http://%'   -- 'c3' 
       THEN CONCAT('http://', SUBSTRING_INDEX(t.link, 'http://', -1))

       WHEN t.link LIKE '%https://%https://%'  -- 'c4'
       THEN CONCAT('https://', SUBSTRING_INDEX(t.link, 'https://', -1))

       ELSE NULL
       END AS new_link

And pretty soon we'll be getting back a result like this...

link                                                           new_link 
-------------------------------------------------------------  ---------------------------------
http://somedomain.com/walp?http://someotherdomain.com/tiwkp    http://someotherdomain.com/tiwkp
http://somedomain.com/walp?https://someotherdomain.com/tiwkp   https://someotherdomain.com/tiwkp
https://somedomain.com/walp?http://someotherdomain.com/tiwkp   http://someotherdomain.com/tiwkp
https://somedomain.com/walp?https://someotherdomain.com/tiwkp  https://someotherdomain.com/tiwkp

Once we have an expression working that is returning a new_link value that we want to assign to the column, in place of the link value,

(and we're running this against a test copy of the table first)

We can convert our SELECT statement into an UPDATE

replace SELECT ... FROM with UPDATE

and add a SET clause to assign the new_link expression to link

(replace the NULL in the ELSE with a reference to the column, so in the event we fall through all of the conditions we're checking in the CASE , we won't be changing that row ...)

UPDATE mytesttable t
   SET t.link
     = CASE 
       WHEN t.link LIKE '%http://%http://%'    -- 'c1'
       THEN CONCAT('http://', SUBSTRING_INDEX(t.link, 'http://', -1))

       WHEN t.link LIKE '%http://%https://%'   -- 'c2'
       THEN CONCAT('https://', SUBSTRING_INDEX(t.link, 'https://', -1))

       WHEN t.link LIKE '%https://%http://%'   -- 'c3' 
       THEN CONCAT('http://', SUBSTRING_INDEX(t.link, 'http://', -1))

       WHEN t.link LIKE '%https://%https://%'  -- 'c4'
       THEN CONCAT('https://', SUBSTRING_INDEX(t.link, 'https://', -1))

       ELSE t.link
       END
 WHERE t.link LIKE '%http%://%http%://%'  
   AND t.link NOT LIKE '%http%://%http%://%http%://%'  

But before we run the UPDATE, we should test our expressions with a variety of link values, including edge and corner cases.

And again note that those LIKE comparisons I wrote have the potential to match wonky rows we might not want to match e.g. http://BLAH http DERP :// flurb http://.

Looks like maybe we should also be checking for a question mark before the second occurrences of http.

We would get more accurate pattern matching using regular expressions, REGEXP (RLIKE) comparisons.


The theme remains the same... test expressions with SELECT first, and then run the UPDATE.

spencer7593
  • 106,611
  • 15
  • 112
  • 140