0

I have a simple query I want to run on Snowflake, that take records with a condition (lets say, start_time) , using a sample of 20,000 rows. The DB is pretty heavy - 1650 columns . the 2 queries that I'm trying are : SELECT * FROM ( SELECT * FROM <table_name> WHERE start_time BETWEEN '2023-04-01' AND '2023-04-21' ) SAMPLE (20000 ROWS)

and

SELECT * FROM <table_name> WHERE start_time BETWEEN '2023-04-01' AND '2023-04-21' ORDER BY RANDOM() LIMIT 20000

Both queries take about 20 minutes to run, which is way too long, I think. Am I doing something wrong? Is there a way I can improve those run times ?

AvitanD
  • 99
  • 10
  • @nbk can you please elaborate more? not sure i understand – AvitanD May 08 '23 at 07:26
  • Your where clause searches for starttime, so the database will search all records if it fits nds Matches, an index can improve the query as the database finds the wanted rows quicket – nbk May 08 '23 at 07:31
  • @nbk AFAIK, Snowflake doesn't support indexing.. – AvitanD May 08 '23 at 08:25
  • 1
    Why are you doing SELECT *? Returning a large number of columns is always going to affect performance. Try just selecting the columns you actually need, otherwise you’d need to use a larger warehouse to improve performance – NickW May 08 '23 at 09:06
  • yes you are right, but you can imrove it somewhat https://stackoverflow.com/a/58492241/5193536 – nbk May 08 '23 at 09:07
  • @AvitanD Snowflake does support indexing, they just use different words for some reason. Screen the docs for "Search Optimization Service". – Limonka May 08 '23 at 09:20
  • @NickW I need all the columns - Im writing a script that creates a decision tree that should give me suggestions regarding the best column to filter by, if that makes sense. A bit hard to explain. – AvitanD May 08 '23 at 10:09
  • What does the query plan show? What is the data type of start_time, date or time stamp? What happens if you remove the ORDER BY clause? – NickW May 08 '23 at 12:03
  • @NickW starttime and endtime are timestamp_ntz, removing the order by (keeping only the limit) greatly improves the time to run the query. Reg the plan, this is the result of the EXPLAIN USING JSON on the first query (with the sample). Column names redacted: {"GlobalStats":{"partitionsTotal":1055925,"partitionsAssigned":2193,"bytesAssigned":32538711552},"Operations":[[{"id":0,"operation":"Result","expressions”:[ – AvitanD May 08 '23 at 12:33
  • In large part snowflake scalability comes from: ["uses a columnar store file format, that lets it only read the parts of the table that contain the fields (columns) you actually use, and thus cut down on I/O on columns that you don't use in the query."](https://stackoverflow.com/a/58495901/2067753) try running the queries with one column and no order by, then run with one column and include order by. The differences between those test runs and your current performance is likely to be the excessive overheads of calling for 1650 columns. – Paul Maxwell May 11 '23 at 00:47
  • @PaulMaxwell so what I understand from this is that snowflake is not the database fitted for my needs.. – AvitanD May 16 '23 at 07:33
  • I cannot say that. I have no idea what your data is or why you have so many columns. – Paul Maxwell May 16 '23 at 12:20

1 Answers1

0

Updating here for general knowledge. After lots of testing and back-and-forth questions with SF experts, we've managed to end up with a much faster query (still not lightining fast but a great improvement):

SELECT {select}
FROM {table} AS main_table INNER JOIN (SELECT * FROM (
    SELECT {unique_col} FROM {table} AS innertable
        WHERE starttime BETWEEN '{starttime}' AND '{endtime}' {conditions}
        ) SAMPLE (20000 rows)
    ) AS unique_table ON main_table.{unique_col}=unique_table.{unique_col}
        WHERE starttime BETWEEN '{starttime}' AND '{endtime}' {conditions}

The explanation here is that I pick a unique col on which I do the "heavy lifting" of sampling, and then join using this column.

Also, the join is between 2 interim tables limited to the start times. The scan on those tables runs in parallel and greatly improves run times.

To wrap it all up, running this on a larger warehouse gives the last push to make this a viable solution.

Thank you to everyone who commented.

AvitanD
  • 99
  • 10