一、pg_trgm的基本概念
- Trigram(三元组):pg_trgm插件引入了Trigram概念,即从一个字符串中取出的由三个连续字符组成的文本组。对于长度小于3的字符串,pg_trgm会通过在字符串前后添加空格来填充得到长度为3的Trigram。
- 相似度计算:将文本拆解为Trigram后,通过对两个字符串之间共享的Trigram计数来度量它们的相似度。这种简单的思想已成为很多自然语言处理中量化词相似度的有效方法。
二、pg_trgm的使用场景
- 文本相似度查询:使用pg_trgm插件中的similarity函数及%操作符,可以实现对文本相似度的查询,并根据相似度进行排序。
- 索引加速查询:pg_trgm插件提供了GIST和GIN索引操作符类,允许在文本列上创建索引以加速相似度查询。这解决了传统模糊查询(如LIKE查询)在大数据量下性能较慢的问题。
三、pg_trgm的安装与使用
- 安装:pg_trgm插件通常包含在PostgreSQL的安装包中,只需在数据库中通过CREATE EXTENSION pg_trgm;命令启用即可。
- 使用:启用插件后,可以在SQL查询中使用pg_trgm提供的函数和操作符进行文本相似度查询和索引加速。
四、pg_trgm的注意事项
- 索引类型选择:pg_trgm插件通常与GIN索引配合使用,因为GIN索引对于包含大量重复值的列(如文本列)具有更好的性能。
- 字符集和编码:pg_trgm插件的性能和效果可能受到数据库字符集和编码的影响,因此在使用时需要注意字符集和编码的选择。
- 性能考虑:虽然pg_trgm插件可以显著提高文本查询的性能,但索引的创建和维护也会消耗一定的系统资源。因此,在使用时需要根据实际情况进行权衡。
五、pg_trgm与其他插件的比较
- pg_bigm:与pg_trgm类似,pg_bigm也是用于模糊查询的插件。不同之处在于,pg_bigm将对象进行切词时,每2个字节为一组进行切割,而pg_trgm则是每3个字符为一组。因此,pg_bigm可以支持任意字数的模糊查询,但索引占用空间可能较小。
- rum:rum是另一个PostgreSQL的扩展插件,用于全文搜索和相似度查询。与pg_trgm相比,rum提供了更丰富的查询功能和更高效的查询性能,但安装和使用可能相对复杂一些。
综上所述,pg_trgm是PostgreSQL中一个非常实用的扩展插件,可以显著提高文本查询的性能和效率。在使用时,需要注意索引类型选择、字符集和编码以及性能考虑等因素。同时,也可以考虑与其他插件进行比较和选择,以满足不同的查询需求。
pg_trgm安装
[sysomm@db1 ~]$ ll -h
total 20K
drwx------ 2 sysomm sysomm 80 Aug 1 11:09 key
drwxr-xr-x. 5 sysomm sysomm 4.0K Jul 31 11:30 mogdb5.0
-rw-r--r-- 1 sysomm sysomm 14K Aug 29 13:56 pg_trgm-1.0-5.0.1-01-CentOS-x86_64.tar.gz
[sysomm@db1 ~]$ gunzip pg_trgm-1.0-5.0.1-01-CentOS-x86_64.tar.gz
[sysomm@db1 ~]$ ls
key mogdb5.0 pg_trgm-1.0-5.0.1-01-CentOS-x86_64.tar
[sysomm@db1 ~]$ tar xf pg_trgm-1.0-5.0.1-01-CentOS-x86_64.tar
[sysomm@db1 ~]$ ls
key mogdb5.0 pg_trgm pg_trgm-1.0-5.0.1-01-CentOS-x86_64.tar
[sysomm@db1 ~]$ cd pg_trgm
[sysomm@db1 pg_trgm]$ make install
cp pg_trgm.so /app/mogdb/5.0/product/lib/postgresql
cp pg_trgm.control /app/mogdb/5.0/product/share/postgresql/extension
cp pg_trgm--1.0.sql /app/mogdb/5.0/product/share/postgresql/extension
cp pg_trgm--unpackaged--1.0.sql /app/mogdb/5.0/product/share/postgresql/extension
pg_trgm使用
[sysomm@db1 pg_trgm]$ gsql -r
gsql ((MogDB 5.0.1 build ae6d2ada) compiled at 2023-08-16 09:07:43 commit 0 last mr 1804 )
Non-SSL connection (SSL connection is recommended when requiring high-security)
Type "help" for help.
MogDB=# \c auxdb
Non-SSL connection (SSL connection is recommended when requiring high-security)
You are now connected to database "auxdb" as user "sysomm".
auxdb=# create extension pg_trgm;
CREATE EXTENSION
auxdb=# \dx
List of installed extensions
Name | Version | Schema | Description
-----------------+---------+------------+-------------------------------------------------------------------
dist_fdw | 1.0 | pg_catalog | foreign-data wrapper for distfs access
file_fdw | 1.0 | pg_catalog | foreign-data wrapper for flat file access
hstore | 1.1 | pg_catalog | data type for storing sets of (key, value) pairs
log_fdw | 1.0 | pg_catalog | Foreign Data Wrapper for accessing logging data
pg_bulkload | 1.0 | public | pg_bulkload is a high speed data loading utility for PostgreSQL
pg_prewarm | 1.1 | public | prewarm relation data
pg_trgm | 1.0 | public | text similarity measurement and index searching based on trigrams
plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
security_plugin | 1.0 | pg_catalog | provides security functionality
(9 rows)
auxdb=# create table trgm_test(id int, name varchar);
CREATE TABLE
auxdb=# insert into trgm_test select generate_series(1,1000000),md5(random()::name);
INSERT 0 1000000
auxdb=# select * from trgm_test limit 10;
id | name
----+----------------------------------
1 | df3acf8714b5e30a61e53557cc0355ca
2 | 3c1cbf7ad346ced08d673cb69f13364b
3 | 1bd0d4087cb509ab8e10b3b9de950e8a
4 | 522614f5c429ac659881b1f12afebbc9
5 | 1fda9f7cce404d4ad2b34991d20cdb7d
6 | fc7343a046dac901f0ae3c7aa355feb9
7 | 6163832dd0c5a9f68d8dba5c3fa1785e
8 | 988c4ecb6e348ce6c0ff8e3792a596c0
9 | 900443a9ce48839106551968b5f03487
10 | b9072c7a37922c87004c1f3ec2f836b0
(10 rows)
auxdb=# explain analyze select * from trgm_test where name like '%69089%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on trgm_test (cost=0.00..21404.74 rows=1664 width=36) (actual time=26.300..335.861 rows=21 loops=1)
Filter: ((name)::text ~~ '%69089%'::text)
Rows Removed by Filter: 999979
Total runtime: 335.958 ms
(4 rows)
auxdb=# select * from trgm_test where name like '%69089%';
id | name
--------+----------------------------------
47269 | 80120371d04f3f0be36908945317f30a
67636 | 7cfd731d05b8630f69089485d9f8020e
70302 | e05e81327ca311d50a0a969089059a95
145994 | ebc857b0dd669089467ddfbd171eb6c7
173267 | d2b5f43821d690896b7345a7dca9e7b0
243805 | f327857d969089c48e493fc7af0698aa
286662 | 69089a37b286f01343ed7fd86e22a2af
360135 | 6693a55d069089b511f016cd87343139
421487 | c55e6fe7369089df35991e8dbc881001
596222 | a184ba369089cbbafba2ebb93cf54504
698764 | 36908959b41ccba3643c5e1b821e830f
781286 | 1455ad6b7369c8f069089cfe65ccbb98
784082 | 49a5eb7576127e69089414ab8aafae2c
816171 | a27b4690895d69e09358bbc77a0040a1
852347 | 831ef698ab63ca1d195b0345d0d69089
858790 | a0d2c5748bf6ec10f34a236908975230
872476 | f04c6922113369089ed794ac950f420e
878412 | e305997eeb2a5d2374690892410658c1
887389 | f4899ec2690897935914cf9b269dc750
954648 | 88002a369089cf65a94e8ddda98f6f1a
975085 | c8a16908931c93b408648b196b786d9c
(21 rows)
auxdb=# \timing on
Timing is on.
auxdb=# select * from trgm_test where name like '%69089%';
id | name
--------+----------------------------------
47269 | 80120371d04f3f0be36908945317f30a
67636 | 7cfd731d05b8630f69089485d9f8020e
70302 | e05e81327ca311d50a0a969089059a95
145994 | ebc857b0dd669089467ddfbd171eb6c7
173267 | d2b5f43821d690896b7345a7dca9e7b0
243805 | f327857d969089c48e493fc7af0698aa
286662 | 69089a37b286f01343ed7fd86e22a2af
360135 | 6693a55d069089b511f016cd87343139
421487 | c55e6fe7369089df35991e8dbc881001
596222 | a184ba369089cbbafba2ebb93cf54504
698764 | 36908959b41ccba3643c5e1b821e830f
781286 | 1455ad6b7369c8f069089cfe65ccbb98
784082 | 49a5eb7576127e69089414ab8aafae2c
816171 | a27b4690895d69e09358bbc77a0040a1
852347 | 831ef698ab63ca1d195b0345d0d69089
858790 | a0d2c5748bf6ec10f34a236908975230
872476 | f04c6922113369089ed794ac950f420e
878412 | e305997eeb2a5d2374690892410658c1
887389 | f4899ec2690897935914cf9b269dc750
954648 | 88002a369089cf65a94e8ddda98f6f1a
975085 | c8a16908931c93b408648b196b786d9c
(21 rows)
Time: 247.768 ms
auxdb=# create index idx_trgm_test_1 on trgm_test using gin(name gin_trgm_ops);
CREATE INDEX
Time: 18986.862 ms
auxdb=# explain analyze select * from trgm_test where name like '%ad44%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on trgm_test (cost=110.00..8832.37 rows=8000 width=36) (actual time=1.284..2.077 rows=438 loops=1)
Recheck Cond: ((name)::text ~~ '%ad44%'::text)
Rows Removed by Index Recheck: 45
Heap Blocks: exact=466
-> Bitmap Index Scan on idx_trgm_test_1 (cost=0.00..108.00 rows=8000 width=0) (actual time=1.195..1.195 rows=483 loops=1)
Index Cond: ((name)::text ~~ '%ad44%'::text)
Total runtime: 2.659 ms
(7 rows)
Time: 4.093 ms
auxdb=# select * from trgm_test where name like '%305696%';
id | name
--------+----------------------------------
55434 | a58f305696f02fa94a340007a9c4de44
140638 | 5143056960d94c7b3061ed6ca8fe3e42
163271 | d9837e8795463056966005d0cbff93d5
173623 | a03056960c89a8b830025f49ef851629
812960 | 0f126471bf930aea96e3056960b0a514
935562 | 4fe41cfc469a2c962ef14305696ad012
937187 | 30569650007b7953fcddef89d2388b91
(7 rows)
Time: 2.326 ms
auxdb=# select * from trgm_test where name like '%69089%';
id | name
--------+----------------------------------
47269 | 80120371d04f3f0be36908945317f30a
67636 | 7cfd731d05b8630f69089485d9f8020e
70302 | e05e81327ca311d50a0a969089059a95
145994 | ebc857b0dd669089467ddfbd171eb6c7
173267 | d2b5f43821d690896b7345a7dca9e7b0
243805 | f327857d969089c48e493fc7af0698aa
286662 | 69089a37b286f01343ed7fd86e22a2af
360135 | 6693a55d069089b511f016cd87343139
421487 | c55e6fe7369089df35991e8dbc881001
596222 | a184ba369089cbbafba2ebb93cf54504
698764 | 36908959b41ccba3643c5e1b821e830f
781286 | 1455ad6b7369c8f069089cfe65ccbb98
784082 | 49a5eb7576127e69089414ab8aafae2c
816171 | a27b4690895d69e09358bbc77a0040a1
852347 | 831ef698ab63ca1d195b0345d0d69089
858790 | a0d2c5748bf6ec10f34a236908975230
872476 | f04c6922113369089ed794ac950f420e
878412 | e305997eeb2a5d2374690892410658c1
887389 | f4899ec2690897935914cf9b269dc750
954648 | 88002a369089cf65a94e8ddda98f6f1a
975085 | c8a16908931c93b408648b196b786d9c
(21 rows)
Time: 2.351 ms