需求
例如一张a表(id, userid, info);其中id为自增长主键。存在userid重复记录;并一直有新数据增加,现需要根据userid去重并实时插入表b。
用sql语法来解释:
-- 获取表b中记录最大id select max(id) from b -- 根据上面获取的id;提取最新的a表记录;根据userid去重;并插入中间表tmp_b; insert into tmp_b select id, userid, info from (select id, userid, info, row_number() over(partition by userid order by id) rn from a where id > ?) where rn = 1; -- 把上面获取的记录;不存在则插入表b insert into b select id, userid, info from tmp_b where userid not in (select userid from b); -- 或者用not exists insert into b select id, userid, info from tmp_b tb where not exists (select 1 from b b where b.userid = tb.userid);
其中需要改写的语句是
select id, userid, info from tmp_b where userid not in (select userid from b);
-- 插入实验数据
-- 在表b插入百万条记录
insert into b select level, dbms_random.string('x',8), 'lottu'||level from dual connect by level <= 1000000;
-- 在表tmp_b插入十万条记录
insert into tmp_b select level, dbms_random.string('x',8), 'lottu'||level from dual connect by level <= 100000;
通过执行计划;会发现not in/ not exists 效率相差不大。
insert into b select id, userid, info from tmp_b tb where not exists (select 1 from b b where b.userid = tb.userid); 100000 rows created. elapsed: 00:00:00.52 execution plan ---------------------------------------------------------- plan hash value: 3462170537 ------------------------------------------------------------------------------------------ | id | operation | name | rows | bytes |tempspc| cost (%cpu)|time | ------------------------------------------------------------------------------------------ | 0 | insert statement | | 115k| 4389k| | 2994 (1)|00:00:36 | | 1 | load table conventional | b | | | | | | |* 2 | hash join anti | | 115k| 4389k| 4960k| 2994 (1)|00:00:36 | | 3 | table access full | tmp_b | 115k| 3601k| | 137 (1)|00:00:02 | | 4 | table access full | b | 1375k| 9403k| | 1372 (1)|00:00:17 | ----------------------------------------------------------------------------------------- predicate information (identified by operation id): --------------------------------------------------- 2 - access("b"."userid"="tb"."userid") note ----- - dynamic sampling used for this statement (level=2) statistics ---------------------------------------------------------- 0 recursive calls 3221 db block gets 6320 consistent gets 35 physical reads 3649980 redo size 845 bytes sent via sql*net to client 866 bytes received via sql*net from client 3 sql*net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 100000 rows processed
替换1: 通过merge into改写语句
merge into b b using tmp_b tb on (tb.userid = b.userid) when not matched then insert (b.id, b.userid, b.info) values (tb.id, tb.userid, tb.info);
通过执行计划;该sql会占用内存用于排序。效率会有大大的提升。 特别是在b数据量很大的情况。优势更明显。
elapsed: 00:00:00.18 execution plan ---------------------------------------------------------- plan hash value: 2722554344 -------------------------------------------------------------------------------------- | id | operation | name | rows | bytes |tempspc| cost (%cpu)| time| -------------------------------------------------------------------------------------- | 0 | merge statement | | 115k| 7203k| | 5298 (1)| 00:01:04 | | 1 | merge | b | | | | || | 2 | view | | | | | || |* 3 | hash join outer | | 115k| 8553k| 4960k| 5298 (1)| 00:01:04 | | 4 | table access full| tmp_b | 115k| 3601k| | 137 (1)| 00:00:02 | | 5 | table access full| b | 1336k| 56m| | 1373 (1)| 00:00:17 | -------------------------------------------------------------------------------------- predicate information (identified by operation id): --------------------------------------------------- 3 - access("tb"."userid"="b"."userid"()) note ----- - dynamic sampling used for this statement (level=2) statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 5496 consistent gets 0 physical reads 0 redo size 844 bytes sent via sql*net to client 896 bytes received via sql*net from client 3 sql*net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 0 rows processed
替换2: 通过kettle工具,设置转换;通过shell命令;用cron或者其他调度系统调用。该操作相对用户是透明的;至于效率方面有待验证。
最后该需求若在postgresql中;有更简洁的的写法.简单粗暴。
insert into b select * from a on conflict (userid) do nothing;
原文链接:http://www.cnblogs.com/lottu/p/11578835.html