4

I'm wondering what is the fastest way to get all non null rows. I've thought of these :

SELECT * FROM table WHERE column IS NOT NULL

SELECT * FROM table WHERE column = column

SELECT * FROM table WHERE column LIKE '%'

(I don't know how to measure execution time in SQL and/or Hive, and from repeatedly trying on a 4M lines table in pgAdmin, I get no noticeable difference.)

leftjoin
  • 36,950
  • 8
  • 57
  • 116
François M.
  • 4,027
  • 11
  • 30
  • 81
  • 2
    Once you use `LIKE` no index will be used....Btw `column = column` doen't make any sense to me – B001ᛦ Sep 01 '17 at 10:40
  • 2
    In the unlikely event that the most straightforward option is slower than the others, file a bug report. –  Sep 01 '17 at 10:44
  • 5
    @bub `column = column` is true unless `column` is `NULL`, because `NULL = NULL` is unknown. –  Sep 01 '17 at 10:44
  • 1
    I would expect the first 2 to have the same execution time as they are essentially doing the same thing and evaluating the same column. Maybe execution plan and take a look? I can tell you that Oracle has exactly no differences on the sexec plan between these 2 – JohnHC Sep 01 '17 at 10:48
  • 1
    According to me, SQL applies the heuristics first on every query before solving it. So Query-1 and Query-2 should have same execution time as they both are doing same things. Probably third one might have something different than others. – Ankit Bajpai Sep 01 '17 at 11:14

1 Answers1

6

You will never notice any difference in performance when running those queries on Hive because these operations are quite simple and run on mappers which are running in parallel.

Initializing/starting mappers takes a lot more time than the possible difference in execution time of these queries and adds a lot of heuristics to the total execution time because mappers may be waiting resources and not running at all.

But you can try to measure time, see this answer about how to measure execution time: https://stackoverflow.com/a/44872319/2700344

SELECT * FROM table WHERE column IS NOT NULL is more straightforward (understandable/readable) though all of queries are correct.

leftjoin
  • 36,950
  • 8
  • 57
  • 116