-1

Recently I am exploring HPE Vertica a bit. Is it possible to find summary statistics (mean,sd,quartiles,max,min,counts etc) from a data table loaded in vertica? These two links; https://my.vertica.com/docs/7.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_STATISTICS.htm

https://my.vertica.com/docs/7.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/ANALYZE_HISTOGRAM.htm

say that we can find statistics & histogram from the data but the result is making no sense to me.

According to it, the ANALYZE_STATISTICS command will throw a 0 for successful execution. Like

NEWDB_aug17=> SELECT ANALYZE_STATISTICS ('MM_schema.capitalline'); 
 ANALYZE_STATISTICS 
--------------------
                  0
(1 row)

Here NEWDB_aug17 is the database, schema is MM_schema under which capitalline table was inserted. But where are the summary measures, i mean the numbers we are actually looking for? Only a 0 is not going to serve my purpose.

Can you please guide me in this context?

Stefan Zobel
  • 3,182
  • 7
  • 28
  • 38
Hindol Ganguly
  • 363
  • 1
  • 4
  • 16

2 Answers2

2

Vertica saves the statistics collected by ANALYZE_STATISTICS() in the catalog location.

These statistics are later used to calculate best query execution plan.

You can find the statistics details in the system table v_internal.dc_analyze_statistics

[dbadmin@vertica-1 ~]$ vsql
dbadmin=> \x
Expanded display is on.
dbadmin=> select * from v_internal.dc_analyze_statistics limit 1;
-[ RECORD 1 ]----+-----------------------------------
time             | 2017-08-21 02:07:03.287895+00
node_name        | v_test_node0001
session_id       | v_test_node0001-502811:0x834a4
user_id          | 45035996273704962
user_name        | dbadmin
transaction_id   | 45035996307673368
statement_id     | 9
request_id       | 1
table_name       | test_table
proj_column_name | test_column
proj_name        | test_table_sp_v11_b1
table_oid        | 45036013037102108
proj_column_oid  | 45036013037111264
proj_row_count   | 119878353211
disk_percent     | 10
disk_read_rows   | 11987835321
sample_rows      | 131072
sample_bytes     | 7602176
start_time       | 2017-08-21 02:07:03.657377+00
end_time         | 2017-08-21 02:07:24.799398+00

Time: First fetch (1 row): 849.467 ms. All rows formatted: 849.594 ms

Or at this path:

{your_catalog_location}/{db_name}/{node_name}_catalog/DataCollector/AnalyzeStatistics_*.log
ya24
  • 490
  • 1
  • 4
  • 16
  • Thanks for your reply. But the result is only capable of returning the counts in terms of summary. Can not I get mean, sd, quartiles, min max in a consolidated view? Just like in R, we write summary(df) or in python we write df.describe() where df is my data. – Hindol Ganguly Aug 24 '17 at 07:21
  • You cannot get it within Vertica without 3rd party tool. – ya24 Aug 25 '17 at 16:16
1

percentile_cont function of Vertica would be helpful in retrieving quartile.

create table test
(metric_value integer);

insert into test values(1);
insert into test values(2);
insert into test values(3);
insert into test values(4);
insert into test values(5);
insert into test values(6);
insert into test values(7);
insert into test values(8);
insert into test values(9);
insert into test values(10);

alter table anatest add column metric varchar(100) default 'abc';

select
 metric_value,
 percentile_cont(1) within group (order by metric_value) over (partition by metric) as max,
 percentile_cont(.75) within group (order by metric_value ) over (partition by metric) as q3,
 percentile_cont(.5) within group (order by metric_value ) over (partition by metric) as median,
 percentile_cont(.25) within group (order by metric_value ) over (partition by metric) as q1,
 percentile_cont(0) within group (order by metric_value ) over (partition by metric) as min
 from test ;