1

Assuming my table is

WITH `sample_project.sample_dataset.table` AS (
  SELECT 'user1' user, 2 sequence, 'T1' ts UNION ALL
  SELECT 'user1', 2, 'T2' UNION ALL
  SELECT 'user1', 1, 'T3' UNION ALL
  SELECT 'user1', 1, 'T4' UNION ALL
  SELECT 'user1', 3, 'T5' UNION ALL
  SELECT 'user1', 2, 'T6' UNION ALL
  SELECT 'user1', 3, 'T7' UNION ALL
  SELECT 'user1', 3, 'T8' 
)

Can I find Subsequence of Integers available in sequence column without using STRING_AGG and REGEX OR JOIN operations ? This is to make query more efficient.

A subsequence is a part of String. For example consider String "banana", A sample subsequence is "anna" as each index character of "anna" from banana is strictly increasing. Characters in a subsequence need not be contiguous.

Say for the above table when order by timestamp (INCREASING), I would get STRING_AGG for sequence column as 22113233. In the String 22113233 subsequence 1 2 3 is available whereas subsequence 3 2 1 is not available. Given a subsequence 213, How can I say if this subsequence is available or not (in 22113233 which sorted by timestamp) ?

phaigeim
  • 729
  • 13
  • 34
  • 1
    what would be definition of subsequence? please explain logic. and why you want to avoid string_agg? how you would otherwise construct final subsequence list string – Mikhail Berlyant Dec 14 '17 at 17:59
  • i see update, but sorry, i am still not getting the logic applicable to the data example you provided. – Mikhail Berlyant Dec 14 '17 at 18:09
  • @MikhailBerlyant I have added the definition of subsequence. My Query is working good with STRING_AGG and REGEX, but the problem is the number of items of `sequence` column for a particular user which is very large whereas the length of subsequence of items that I want to search is very small. I am curious if I can optimise here. – phaigeim Dec 14 '17 at 18:10
  • @MikhailBerlyant Am I able to explain it more clearly ? – phaigeim Dec 14 '17 at 18:13
  • 1
    what you mean by "Can I find a given ..." - I just don't get it! if you think you explained it very clearly and nothing more to clarify - someone else might be able to help – Mikhail Berlyant Dec 14 '17 at 18:16
  • 1
    i think now it is clear :o) even though it is still confusing that you want to eliminate string_agg - how you would get `22113233` without string_agg! – Mikhail Berlyant Dec 14 '17 at 18:21
  • @MikhailBerlyant Thanks :) – phaigeim Jan 10 '18 at 07:53

1 Answers1

3

Given a subsequence 213, How can I say if this subsequence is available or not (in 22113233 ...

Below example is for BigQuery SQL

#standardSQL
WITH `sequences` AS (
  SELECT '22113233' sequence_list 
), `subsequenses` AS (
  SELECT '123' subsequence UNION ALL
  SELECT '321' UNION ALL
  SELECT '213'
)
SELECT sequence_list, subsequence, 
  REGEXP_CONTAINS(sequence_list, REGEXP_REPLACE(subsequence, '', '.*')) available
FROM `sequences` l
CROSS JOIN `subsequenses` s   

with result as below

sequence_list   subsequence     available    
22113233        321             false    
22113233        123             true     
22113233        213             true     

if you are looking for specific subsequence - this can be further simplified as

#standardSQL
WITH `sequences` AS (
  SELECT '22113233' sequence_list UNION ALL
  SELECT '11223322'
)
SELECT sequence_list,  
  REGEXP_CONTAINS(sequence_list, REGEXP_REPLACE('213', '', '.*')) available
FROM `sequences`

with result as

sequence_list   available    
22113233        true     
11223322        false    
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Can I find Subsequence of Integers available in sequence column without using STRING_AGG and REGEX OR JOIN operations ? This is to make query more efficient. I wanted to solve this without using this in order to optimise it, rather than keeping every Integer of the sequence column to construct sequence list. – phaigeim Dec 14 '17 at 19:26
  • I have used STRING_AGG and REGEX which is taking more time.. I want to optimise it – phaigeim Dec 14 '17 at 19:27
  • 1
    sorry for comparison but what you ask sounds equivalent to me if you ask to write sql without using sql. if i will get an idea how to avoid STRING_AGG and REGEXP here - i will definitelly reply again. But i don't think it will be more optimal then using those! – Mikhail Berlyant Dec 14 '17 at 19:29
  • Thanks @MikhailBerlyant . Rather than doing STRING_AGG and then doing REGEX, can we just look for subsequence while during AGGREGATING and mark it true/false ? – phaigeim Dec 14 '17 at 19:44
  • 1
    Technically, it is doable - but I doubt it will be more optimal as it still will involve JOINing and extra handling of orders. so I doubt it will be better. – Mikhail Berlyant Dec 14 '17 at 19:57
  • I am happy that we both agree it is doable. I am curious if for a user I have one billion entries in `sequence` column and if I am interested in finding subsequence `123` which I may be can find in first 1 million, so that I don't need other 99 million to store, thereby reducing shuffling and other computational costs.. – phaigeim Dec 14 '17 at 21:24
  • i don't think this is going to work this way! what I can recommend - if this will fit into your business requirements - is to introduce some sort of threshold - let's say expected pattern should be within 1000 neighboring rows - in this case you can use analytic windowed functions - and this can optimize your stuff i think – Mikhail Berlyant Dec 14 '17 at 21:27
  • Thanks for suggesting analytic windowed functions. I will have a closer look at them. – phaigeim Dec 14 '17 at 21:44
  • sure. please note - that suggestion will work if you can introduce that "neighboring" threshold :o) – Mikhail Berlyant Dec 14 '17 at 21:45