It looks like you have a total of four possible combinations we're looking at.
The first can be either 'http' or 'https', and the second can be either 'http' or 'https'. Four possible combinations:
http:/ http:/
http:/ https:/
https:/ http:/
https:/ https:/
A couple of suggestions:
Firstly, before we execute an UPDATE
statement that we're unsure that it is going to do what we intend it to do, we should write a SELECT
statement to test the expressions. That let's us see the results of the expression, so we can verify that it is doing what we intend it to, under a variety of test conditions.
Secondly, it should be be possible to separate out the combinations (the four possibilities). If the changes we are applying is "removing" the first http/https leaving only one occurrence... then a modification we apply to one of the subsets won't make a row then move into a different subset. (I understood what I meant to say there, it probably came across garbled.)
If I've got row like this
http 1 https 2
and I change that with an update to be
https 2
Then a subsequent run through the rows, that's not going to match a check for
https https
because there's only one occurrence of http in the string.
Let's assume that we're only interested in rows where the link
string contains two occurrences of http://
/ https://
We could use a regular expression do some matching, or we can fudge it with some LIKE
comparisons
SELECT t.link
, t.link LIKE '%http://%http://%' AS c1
, t.link LIKE '%http://%https://%' AS c2
, t.link LIKE '%https://%http://%' AS c3
, t.link LIKE '%https://%https://%' AS c4
FROM (
SELECT 'http://somedomain.com/work-a-link.php?http://someotherdomain.com/thisisworthkeeping.php' AS link
UNION ALL
SELECT 'http://somedomain.com/work-a-link.php?https://someotherdomain.com/thisisworthkeeping.php'
UNION ALL
SELECT 'https://somedomain.com/work-a-link.php?http://someotherdomain.com/thisisworthkeeping.php'
UNION ALL
SELECT 'https://somedomain.com/work-a-link.php?https://someotherdomain.com/thisisworthkeeping.php'
) t
WHERE t.link LIKE '%http%://%http%://%'
AND t.link NOT LIKE '%http%://%http%://%http%://%'
returns something like this (with some string replacement in the link values to shorten them here) ...
link c1 c2 c3 c4
-------------------------------------------------------------- -- -- -- --
http://somedomain.com/walp?http://someotherdomain.com/tiwkp 1 0 0 0
http://somedomain.com/walp?https://someotherdomain.com/tiwkp 0 1 0 0
https://somedomain.com/walp?http://someotherdomain.com/tiwkp 0 0 1 0
https://somedomain.com/walp?https://someotherdomain.com/tiwkp 0 0 0 1
We've include a condition in the WHERE
clause to "filter out" any URL where there are more than two occurrences of "http[s]://". (There's a little bit of fudging in there, the %
wildcard could match any number of characters, so we're not exactly checking for https:// and http:// ... (again, we could implement a regular expression (REGEXP or RLIKE) comparison to be more exact.
But notice how we've managed to "categorize" the values link into c1, c2, c3, and c4.
Confirming that each link falls into a single category,
We can include an additional expression in the SELECT list
, CASE
WHEN t.link LIKE '%http://%http://%' THEN 'c1'
WHEN t.link LIKE '%http://%https://%' THEN 'c2'
WHEN t.link LIKE '%https://%http://%' THEN 'c3'
WHEN t.link LIKE '%https://%https://%' THEN 'c4'
ELSE NULL
END
If we find a matching WHEN condition, we return the THEN, and we're done. (If there is an overlap, a link that falls into more than one category, we won't see it in this expression.)
link c c1 c2 c3 c4
-------------------------------------------------------------- -- -- -- -- --
http://somedomain.com/walp?http://someotherdomain.com/tiwkp c1 1 0 0 0
http://somedomain.com/walp?https://someotherdomain.com/tiwkp c2 0 1 0 0
https://somedomain.com/walp?http://someotherdomain.com/tiwkp c3 0 0 1 0
https://somedomain.com/walp?https://someotherdomain.com/tiwkp c4 0 0 0 1
We can test with a variety of link
values, various patterns, and verify that our matching "categorization" is working the way we intend.
(This technique of using a SELECT statement to test expressions, and especially functions we're not familiar with... with SUBSTRING_INDEX, what gets returned when the search string isn't found? is the REPLACE function case sensitive? and so on.)
We can adjust and tweak, experiment and find the right combinations to get it working the way we want. Once we get that,
We can include yet another expression in the SELECT list. We'll copy the one we just added, but this time, instead of returning a literal, we will include a different expression, one that exercises the SUBSTRING_INDEX
and REPLACE
functions.
Since we're running a SELECT
we know that we're not going to mangle/munge the contents of the table. We're just testing some expressions, to see what they return.
, CASE
WHEN t.link LIKE '%http://%http://%' -- 'c1'
THEN CONCAT('http://', SUBSTRING_INDEX(t.link, 'http://', -1))
WHEN t.link LIKE '%http://%https://%' -- 'c2'
THEN CONCAT('https://', SUBSTRING_INDEX(t.link, 'https://', -1))
WHEN t.link LIKE '%https://%http://%' -- 'c3'
THEN CONCAT('http://', SUBSTRING_INDEX(t.link, 'http://', -1))
WHEN t.link LIKE '%https://%https://%' -- 'c4'
THEN CONCAT('https://', SUBSTRING_INDEX(t.link, 'https://', -1))
ELSE NULL
END AS new_link
And pretty soon we'll be getting back a result like this...
link new_link
------------------------------------------------------------- ---------------------------------
http://somedomain.com/walp?http://someotherdomain.com/tiwkp http://someotherdomain.com/tiwkp
http://somedomain.com/walp?https://someotherdomain.com/tiwkp https://someotherdomain.com/tiwkp
https://somedomain.com/walp?http://someotherdomain.com/tiwkp http://someotherdomain.com/tiwkp
https://somedomain.com/walp?https://someotherdomain.com/tiwkp https://someotherdomain.com/tiwkp
Once we have an expression working that is returning a new_link
value that we want to assign to the column, in place of the link
value,
(and we're running this against a test copy of the table first)
We can convert our SELECT statement into an UPDATE
replace SELECT ... FROM
with UPDATE
and add a SET
clause to assign the new_link expression to link
(replace the NULL in the ELSE with a reference to the column, so in the event we fall through all of the conditions we're checking in the CASE , we won't be changing that row ...)
UPDATE mytesttable t
SET t.link
= CASE
WHEN t.link LIKE '%http://%http://%' -- 'c1'
THEN CONCAT('http://', SUBSTRING_INDEX(t.link, 'http://', -1))
WHEN t.link LIKE '%http://%https://%' -- 'c2'
THEN CONCAT('https://', SUBSTRING_INDEX(t.link, 'https://', -1))
WHEN t.link LIKE '%https://%http://%' -- 'c3'
THEN CONCAT('http://', SUBSTRING_INDEX(t.link, 'http://', -1))
WHEN t.link LIKE '%https://%https://%' -- 'c4'
THEN CONCAT('https://', SUBSTRING_INDEX(t.link, 'https://', -1))
ELSE t.link
END
WHERE t.link LIKE '%http%://%http%://%'
AND t.link NOT LIKE '%http%://%http%://%http%://%'
But before we run the UPDATE, we should test our expressions with a variety of link values, including edge and corner cases.
And again note that those LIKE
comparisons I wrote have the potential to match wonky rows we might not want to match e.g. http://BLAH http DERP :// flurb http://
.
Looks like maybe we should also be checking for a question mark before the second occurrences of http
.
We would get more accurate pattern matching using regular expressions, REGEXP (RLIKE) comparisons.
The theme remains the same... test expressions with SELECT first, and then run the UPDATE.