not in/ not exists 有哪些可以替代操作-尊龙凯时

需求
　　例如一张a表(id, userid, info);其中id为自增长主键。存在userid重复记录；并一直有新数据增加，现需要根据userid去重并实时插入表b。

用sql语法来解释：

-- 获取表b中记录最大id
select max(id) from b 
-- 根据上面获取的id；提取最新的a表记录；根据userid去重；并插入中间表tmp_b;
insert into tmp_b
  select id, userid, info
    from (select id,
                 userid,
                 info,
                 row_number() over(partition by userid order by id) rn
            from a
           where id > ?)
   where rn = 1;
-- 把上面获取的记录；不存在则插入表b
insert into b
  select id, userid, info
    from tmp_b
   where userid not in (select userid from b);
            
-- 或者用not exists
insert into b
  select id, userid, info
    from tmp_b tb
   where not exists (select 1 from b b where b.userid = tb.userid);

其中需要改写的语句是

select id, userid, info
    from tmp_b
   where userid not in (select userid from b);  
-- 插入实验数据

-- 在表b插入百万条记录
insert into b select level, dbms_random.string('x',8), 'lottu'||level from dual connect by level <= 1000000;
-- 在表tmp_b插入十万条记录
insert into tmp_b select level, dbms_random.string('x',8), 'lottu'||level from dual connect by level <= 100000;

通过执行计划；会发现not in/ not exists 效率相差不大。

insert into b
select id, userid, info  from tmp_b tb
where not exists (select 1 from b b where b.userid = tb.userid);
100000 rows created.
elapsed: 00:00:00.52
execution plan
----------------------------------------------------------
plan hash value: 3462170537
------------------------------------------------------------------------------------------
| id  | operation         | name  | rows  | bytes |tempspc| cost (%cpu)|time     |
------------------------------------------------------------------------------------------
|   0 | insert statement     |     |   115k|  4389k|     |  2994   (1)|00:00:36 |
|   1 |  load table conventional | b     |     |     |     |          |     |
|*  2 |   hash join anti     |     |   115k|  4389k|  4960k|  2994   (1)|00:00:36 |
|   3 |    table access full     | tmp_b |   115k|  3601k|     |   137   (1)|00:00:02 |
|   4 |    table access full     | b     |  1375k|  9403k|     |  1372   (1)|00:00:17 |
-----------------------------------------------------------------------------------------

predicate information (identified by operation id):
---------------------------------------------------
   2 - access("b"."userid"="tb"."userid")
note
-----
   - dynamic sampling used for this statement (level=2)
statistics
----------------------------------------------------------
      0  recursive calls
       3221  db block gets
       6320  consistent gets
     35  physical reads
    3649980  redo size
    845  bytes sent via sql*net to client
    866  bytes received via sql*net from client
      3  sql*net roundtrips to/from client
      1  sorts (memory)
      0  sorts (disk)
     100000  rows processed

替换1：通过merge into改写语句

merge into b b
using tmp_b tb
on (tb.userid = b.userid)
when not matched then
  insert (b.id, b.userid, b.info) values (tb.id, tb.userid, tb.info);

通过执行计划；该sql会占用内存用于排序。效率会有大大的提升。特别是在b数据量很大的情况。优势更明显。

elapsed: 00:00:00.18
execution plan
----------------------------------------------------------
plan hash value: 2722554344
--------------------------------------------------------------------------------------
| id  | operation         | name  | rows  | bytes |tempspc| cost (%cpu)| time|
--------------------------------------------------------------------------------------
|   0 | merge statement      |         |     115k|    7203k|         |    5298   (1)| 00:01:04 |
|   1 |  merge             | b     |         |         |         |          ||
|   2 |   view             |         |         |         |         |          ||
|*  3 |    hash join outer   |         |     115k|    8553k|    4960k|    5298   (1)| 00:01:04 |
|   4 |     table access full| tmp_b |     115k|    3601k|         |     137   (1)| 00:00:02 |
|   5 |     table access full| b     |    1336k|      56m|         |    1373   (1)| 00:00:17 |
--------------------------------------------------------------------------------------
predicate information (identified by operation id):
---------------------------------------------------
   3 - access("tb"."userid"="b"."userid"())
note
-----
   - dynamic sampling used for this statement (level=2)
statistics
----------------------------------------------------------
      0  recursive calls
      0  db block gets
       5496  consistent gets
      0  physical reads
      0  redo size
    844  bytes sent via sql*net to client
    896  bytes received via sql*net from client
      3  sql*net roundtrips to/from client
      1  sorts (memory)
      0  sorts (disk)
      0  rows processed

替换2：通过kettle工具，设置转换；通过shell命令；用cron或者其他调度系统调用。该操作相对用户是透明的；至于效率方面有待验证。

最后该需求若在postgresql中；有更简洁的的写法.简单粗暴。

insert into b
 select *
from a on conflict (userid) do nothing;

原文链接：http://www.cnblogs.com/lottu/p/11578835.html