Greenplum 注意对其数据类型,否则优化器让你好看

在测试tpch时,发现有一些SQL跑超时,原因是测试SQL中有一些JOIN查询,而这些JOIN的列数据类型不一致,导致无法使用索引,或者无法使用hash join。
例子:
int和int8,都是整型,只是长度不一样。关联时,会遇到性能问题,因为不能使用HASH JOIN。
digoal=# create table t(id int);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create table t1(id int8);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create index idx_t on t(id);
CREATE INDEX
digoal=# create index idx_t1 on t1(id);
CREATE INDEX
digoal=# insert into t select generate_series(1,10000);
INSERT 0 10000
digoal=# insert into t1 select generate_series(1,10000);
INSERT 0 10000
关联字段为int和int8,不能使用hash join,只能用nestloop。
digoal=# explain select * from t ,t1 where t.id=t1.id;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.00..218902.00 rows=10000 width=12)
   ->  Nested Loop  (cost=0.00..218902.00 rows=455 width=12)
         ->  Broadcast Motion 22:22  (slice1; segments: 22)  (cost=0.00..2422.00 rows=10000 width=4)
               ->  Seq Scan on t  (cost=0.00..122.00 rows=455 width=4)
         ->  Index Scan using idx_t1 on t1  (cost=0.00..0.97 rows=1 width=8)
               Index Cond: t.id = t1.id
 Settings:  enable_seqscan=off
 Optimizer status: legacy query optimizer
(8 rows)
即使强制关掉nestloop也不行。
digoal=# set enable_nestloop=off;
SET
digoal=# explain select * from t ,t1 where t.id=t1.id;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.00..218902.00 rows=10000 width=12)
   ->  Nested Loop  (cost=0.00..218902.00 rows=455 width=12)
         ->  Broadcast Motion 22:22  (slice1; segments: 22)  (cost=0.00..2422.00 rows=10000 width=4)
               ->  Seq Scan on t  (cost=0.00..122.00 rows=455 width=4)
         ->  Index Scan using idx_t1 on t1  (cost=0.00..0.97 rows=1 width=8)
               Index Cond: t.id = t1.id
 Settings:  enable_nestloop=off; enable_seqscan=off
 Optimizer status: legacy query optimizer
(8 rows)
和数据量当然并没有关系。
digoal=# insert into t select generate_series(1,10000000);
INSERT 0 10000000
digoal=# insert into t1 select generate_series(1,10000000);
INSERT 0 10000000
digoal=# explain select * from t ,t1 where t.id=t1.id;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.00..218902.00 rows=10000 width=12)
   ->  Nested Loop  (cost=0.00..218902.00 rows=455 width=12)
         ->  Broadcast Motion 22:22  (slice1; segments: 22)  (cost=0.00..2422.00 rows=10000 width=4)
               ->  Seq Scan on t  (cost=0.00..122.00 rows=455 width=4)
         ->  Index Scan using idx_t1 on t1  (cost=0.00..0.97 rows=1 width=8)
               Index Cond: t.id = t1.id
 Settings:  enable_nestloop=off; enable_seqscan=off
 Optimizer status: legacy query optimizer
(8 rows)
接下来,使用同样的数据类型,可以用HASH JOIN关联。性能大幅提升。
digoal=# create table t2(id int8);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# insert into t2 select * from t1;
INSERT 0 10010000
digoal=# create index idx_t2 on t2(id);
CREATE INDEX
digoal=# explain select * from t2 join t1 on( t2.id=t1.id );
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=236320.07..497677.13 rows=10013203 width=16)
   ->  Hash Join  (cost=236320.07..497677.13 rows=455146 width=16)
         Hash Cond: t1.id = t2.id
         ->  Seq Scan on t1  (cost=0.00..111158.22 rows=455161 width=8)
         ->  Hash  (cost=111155.03..111155.03 rows=455146 width=8)
               ->  Seq Scan on t2  (cost=0.00..111155.03 rows=455146 width=8)
 Optimizer status: legacy query optimizer
(7 rows)

另一个例子,当char长度不一样,不会有以上类似的问题。
digoal=# create table t(id char(10));
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# create table t1(id char(20));
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
digoal=# insert into t1 select generate_series(1,10000000);
INSERT 0 10000000
digoal=# insert into t select generate_series(1,10000000);
INSERT 0 10000000
digoal=# explain select * from t join t1 on( t.id=t1.id );
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=240352.38..503423.59 rows=9975039 width=32)
   ->  Hash Join  (cost=240352.38..503423.59 rows=453411 width=32)
         Hash Cond: t.id = t1.id
         ->  Seq Scan on t  (cost=0.00..113396.58 rows=454303 width=11)
         ->  Hash  (cost=115664.39..115664.39 rows=453411 width=21)
               ->  Seq Scan on t1  (cost=0.00..115664.39 rows=453411 width=21)
 Optimizer status: legacy query optimizer
(7 rows)

digoal=# create index idx_t on t(id);
CREATE INDEX
digoal=# create index idx_t1 on t1(id);
CREATE INDEX
digoal=# explain select * from t join t1 on( t.id=t1.id );
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=240352.38..503423.59 rows=9975039 width=32)
   ->  Hash Join  (cost=240352.38..503423.59 rows=453411 width=32)
         Hash Cond: t.id = t1.id
         ->  Seq Scan on t  (cost=0.00..113396.58 rows=454303 width=11)
         ->  Hash  (cost=115664.39..115664.39 rows=453411 width=21)
               ->  Seq Scan on t1  (cost=0.00..115664.39 rows=453411 width=21)
 Optimizer status: legacy query optimizer
(7 rows)

digoal=# explain select * from t join t1 on( t.id=t1.id and t.id='1');
                                   QUERY PLAN
--------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..401.54 rows=4 width=32)
   ->  Nested Loop  (cost=0.00..401.54 rows=1 width=32)
         ->  Index Scan using idx_t on t  (cost=0.00..200.70 rows=1 width=11)
               Index Cond: id = '1'::bpchar
         ->  Index Scan using idx_t1 on t1  (cost=0.00..200.83 rows=1 width=21)
               Index Cond: '1'::bpchar = id
 Optimizer status: legacy query optimizer
(7 rows)
varchar和char(n)也不存在问题。
postgres=# create table t(id varchar);
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
postgres=# create table t1(id char(10));
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
postgres=# explain select * from t , t1 where t.id=t1.id;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice2; segments: 22)  (cost=0.01..0.07 rows=4 width=76)
   ->  Hash Join  (cost=0.01..0.07 rows=1 width=76)
         Hash Cond: t.id::bpchar = t1.id
         ->  Redistribute Motion 22:22  (slice1; segments: 22)  (cost=0.00..0.02 rows=1 width=32)
               Hash Key: t.id::bpchar
               ->  Seq Scan on t  (cost=0.00..0.00 rows=1 width=32)
         ->  Hash  (cost=0.00..0.00 rows=1 width=44)
               ->  Seq Scan on t1  (cost=0.00..0.00 rows=1 width=44)
 Optimizer status: legacy query optimizer
(9 rows)

数据类型和查询条件的类型不一致时,不走索引的例子。
digoal=# explain select * from t where id=1;
                                      QUERY PLAN
---------------------------------------------------------------------------------------
 Gather Motion 22:1  (slice1; segments: 22)  (cost=0.00..163369.87 rows=9995 width=11)
   ->  Seq Scan on t  (cost=0.00..163369.87 rows=455 width=11)
         Filter: id::text = '1'::text
 Optimizer status: legacy query optimizer
(4 rows)

digoal=# explain select * from t where id='1';
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..200.70 rows=1 width=11)
   ->  Index Scan using idx_t on t  (cost=0.00..200.70 rows=1 width=11)
         Index Cond: id = '1'::bpchar
 Optimizer status: legacy query optimizer
(4 rows)
显示转换后,可以走索引。
digoal=# explain select * from t where id=1::bpchar;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Gather Motion 1:1  (slice1; segments: 1)  (cost=0.00..200.70 rows=1 width=11)
   ->  Index Scan using idx_t on t  (cost=0.00..200.70 rows=1 width=11)
         Index Cond: id = '1'::bpchar
 Optimizer status: legacy query optimizer
(4 rows)

greenplum因为用了早期的postgresql版本,所以这一方面优化器并不好,在使用greenplum时,需要注意一下。
greenplum将合并到postgresql 9.5的版本,以上问题可以消除。
postgresql 9.4的例子:
digoal=# create table t(id int);
CREATE TABLE
digoal=# create table t1(id int8);
CREATE TABLE
digoal=# insert into t select generate_series(1,100000);
INSERT 0 100000
digoal=# insert into t1 select generate_series(1,100000);
INSERT 0 100000
digoal=# explain select * from t,t1 where t.id=t1.id;
                              QUERY PLAN
----------------------------------------------------------------------
 Hash Join  (cost=2693.00..6136.00 rows=100000 width=12)
   Hash Cond: (t.id = t1.id)
   ->  Seq Scan on t  (cost=0.00..1443.00 rows=100000 width=4)
   ->  Hash  (cost=1443.00..1443.00 rows=100000 width=8)
         ->  Seq Scan on t1  (cost=0.00..1443.00 rows=100000 width=8)
(5 rows)
时间: 2023-12-03 15:17:57

Greenplum 注意对其数据类型,否则优化器让你好看的相关文章

Greenplum ORCA 优化器的编译安装与使用

背景 ORCA 是PostgreSQL的下一代优化器,在QUERY的优化上比自带的优化器有长足的进步. https://github.com/greenplum-db/gporca 安装ORCA cmake wget https://cmake.org/files/v3.5/cmake-3.5.2.tar.gz tar -zxvf cmake-3.5.2.tar.gz cd cmake-3.5.2 ./configure --prefix=/home/digoal/cmake make make

ORACLE优化器

oracle|优化 ORACLE的优化器共有3种: a. RULE (基于规则) b. COST (基于成本) c. CHOOSE (选择性) 设置缺省的优化器,可以通过对init.ora文件中OPTIMIZER_MODE参数的各种声明,如RULE,COST,CHOOSE,ALL_ROWS,FIRST_ROWS . 你当然也在SQL句级或是会话(session)级对其进行覆盖. 为了使用基于成本的优化器(CBO, Cost-Based Optimizer) , 你必须经常运行analyze 命令

如何对Oracle中的优化器进行评估优化

Oracle优化器在任何可能的时候都会对表达式进行评估,并且把特定的语法结构转换成等价的结构,这么做的原因是: ·要么结果表达式能够比源表达式具有更快的速度. ·要么源表达式只是结果表达式的一个等价语义结构. 不同的SQL结构有时具有同样的操作(例如:= ANY (subquery) and IN (subquery)),Oracle会把他们映射到一个单一的语义结构. 下面将讨论优化器如何评估优化如下的情况和表达式: 常量 LIKE 操作符 IN 操作符 ANY和SOME 操作符 ALL 操作符

MaxCompute理解数据、运算和用户的大脑:基于代价的优化器

MaxCompute简介 大数据计算服务(MaxCompute)是一种快速.完全托管的PB/EB级数据仓库解决方案,MaxCompute具备万台服务器扩展能力和跨地域容灾能力,是阿里巴巴内部核心大数据平台,承担了集团内部绝大多数的计算任务,支撑每日百万级作业规模.MaxCompute向用户提供了完善的数据导入方案以及多种经典的分布式计算模型,能够更快速的解决用户海量数据计算问题,有效降低企业成本,并保障数据安全. MaxCompute架构 MaxCompute基本的体系结构如上图所示,最底层就是

PostgreSQL 操作符与优化器详解

PostgreSQL 支持自定义操作符,本质上是调用函数来实现的. 语法如下: 例如创建一个求两个值的平均值的操作符: 首选要创建函数 postgres=# create function f_avg(numeric,numeric) returns numeric as $$ postgres$#   select ($1+$2)/2; postgres$# $$ language sql strict; CREATE FUNCTION 验证函数 postgres=# select f_avg

【云和恩墨大讲堂】从执行计划洞察ORACLE优化器的“小聪明”

作者简介黄浩  惠普 十年一剑,十年磨砺.3年通信行业,写就近3万条SQL:5年制造行业,遨游在ETL的浪潮:2年性能优化,厚积薄发自成一家 主题介绍: Oracle执行计划的另类解读:调皮的执行计划 | 诚实的执行计划 | 朴实的执行计划 说到执行计划,oracle的拥趸们自然而然会兴奋起来.在ORACLE的世界里,执行计划有着其特殊的地位,如果我们将SQL性能优化看成一个生物,那某种程度上,执行计划就是DNA.在某搜索网站中,"oracle 执行计划"关键字的搜索结果与"

设置MySQL中的数据类型来优化运行速度的实例_Mysql

今天看了一个优化案例觉的挺有代表性,这里记录下来做一个标记,来纪念一下随便的字段定义的问题. 回忆一下,在表的设计中很多人习惯的把表的结构设计成Varchar(64),Varchar(255)之类的,虽然大多数情况只存了5-15个字节.那么我看一下下面这个案例. 查询语句:   SELECT SQL_NO_CACHE channel, COUNT(channel) AS visitors FROM xxx_sources WHERE client_id = 1301 GROUP BY clien

ORACLE优化器RBO与CBO介绍总结

RBO和CBO的基本概念   Oracle数据库中的优化器又叫查询优化器(Query Optimizer).它是SQL分析和执行的优化工具,它负责生成.制定SQL的执行计划.Oracle的优化器有两种,基于规则的优化器(RBO)与基于代价的优化器(CBO)          RBO: Rule-Based Optimization 基于规则的优化器          CBO: Cost-Based Optimization 基于代价的优化器 RBO 自ORACLE 6以来被采用,一直沿用至ORA

为什么基于代价的优化器做出错误选择

基于代价的优化器(CBO)在进行全表扫描时偶尔会作出一些错误的选择,这种情况尤其发生在Oracle7和Oracle8之中.有几种情况会导致这个问题,分别如下所示:最高使用标记(High water mark)太高:当要在一个表中进行大量的删除时,最高使用标记可能会远远高于实际用到的数据块(block)数量.因此,如果依赖于最高使用标记,CBO常常会错误的调用全表扫描. 错误的优化模式:如果OPTIMIZER_MODE被设置为ALL_ROWS或者CHOOSE,那么SQL优化器会更乐于使用全表扫描.