前提要述:参考书籍《MySQL必知必会》
文章目录
- 14.1 全文本搜索
- 14.1.1 启动全文本搜索
- 14.1.2 使用全文本搜索
- 14.1.3 使用查询扩展
- 14.1.4 使用布尔查询
- 14.1.5 总结
14.1 全文本搜索
要了解全文本搜索,就要先了解引擎,也就是我们在创建表时,会在最后指定一个ENGINE值,即引擎类型。下面是3种常见的引擎类型:
- InnoBD是一个可靠的事务处理引擎,它不支持全文本搜索,MySQL 5.6以后就可以把全文本搜索用在InnoDB表引擎中了 ;
- MEMORY在功能等同于MyISAM,但由于数据存储在内存(不是磁盘)中,速度很快(特别适合于临时表)。
- MyISAM是一个性能极高的引擎,它支持全文本搜索,但不支持事务处理。
可以看到,并非所有的引擎都支持全文本搜索。所以要使用全文搜索,必须指定ENGINE=MyISAM。
注意:MySQL 5.6以后就可以把全文本搜索用在InnoDB表引擎中了,但是现在是基于《MySQL必知必会》的书。
在前面也学了几个高级查询:LIKE关键字,利用通配符匹配文本;正则表达式,可编写更复杂的匹配模式。
而这些搜索机制存储几个重要的限制:
- 性能:通配符和正则表达式匹配通常要求MySQL尝试匹配表中的所有行(而且这些搜索极少使用表索引)。因此,由于被搜索的行不断增加,这些搜索可能非常耗时。
- 明确控制:使用通配符和正则表达式匹配,很难(并且不总是)明确地控制匹配什么和不匹配什么。例如:指定一个词必须匹配,一个词必须不匹配,而一个词仅在第一个词确实匹配的情况下才可以匹配或者才可以不匹配。
- 智能化的结果:虽然通配符和正则表达式的搜索提供了非常灵活的搜索,但它们都不能提供一种智能化的选择结果的方法。例如,一个特殊词的搜索将会返回包含该词的所有行,而不区分包含单个匹配的行和包含多个匹配的行。类似,一个特殊词的搜索将不会找出不包含该词但包含其他相关词的行。
所以,这些限制或者更多的其他限制就可用全文本搜索来解决。在使用全文本搜索时,MySQL不需要分别查看每个行,不需要分别分析和处理每个词。MySQL创建指定列中各词的一个索引,搜索可以针对这些词进行。这些,MySQL可用快速有效地决定哪些词匹配,哪些词不匹配等等。
14.1.1 启动全文本搜索
为了进行全文本搜索,必须索引被搜索的列,而且要随着数据的改变不断地重新索引。这就需要在设计表时设置好,然后MySQL会自动进行所有的索引和重新索引。
#######################################
# 作用:存储与特定产品有关的注释 #
# 但是并不是所有的产品都有注释 #
# note_id 唯一注释ID #
# prod_id 产品ID(对应products表中的prod_id) #
# note_date 增加注释的日期 #
# note_test 注释文本 #
#######################################
CREATE TABLE productnotes
(
note_id int NOT NULL AUTO_INCREMENT,
prod_id char(10) NOT NULL,
note_date datetime NOT NULL,
note_text text NULL ,
PRIMARY KEY(note_id),
FULLTEXT(note_text)
) ENGINE=MyISAM;
然后插入数据:
# productnotes
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(101, 'TNT2', '2005-08-17',
'Customer complaint:
Sticks not individually wrapped, too easy to mistakenly detonate all at once.
Recommend individual wrapping.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(102, 'OL1', '2005-08-18',
'Can shipped full, refills not available.
Need to order new can if refill needed.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(103, 'SAFE', '2005-08-18',
'Safe is combination locked, combination not provided with safe.
This is rarely a problem as safes are typically blown up or dropped by customers.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(104, 'FC', '2005-08-19',
'Quantity varies, sold by the sack load.
All guaranteed to be bright and orange, and suitable for use as rabbit bait.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(105, 'TNT2', '2005-08-20',
'Included fuses are short and have been known to detonate too quickly for some customers.
Longer fuses are available (item FU1) and should be recommended.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(106, 'TNT2', '2005-08-22',
'Matches not included, recommend purchase of matches or detonator (item DTNTR).'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(107, 'SAFE', '2005-08-23',
'Please note that no returns will be accepted if safe opened using explosives.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(108, 'ANV01', '2005-08-25',
'Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(109, 'ANV03', '2005-09-01',
'Item is extremely heavy. Designed for dropping, not recommended for use with slings, ropes, pulleys, or tightropes.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(110, 'FC', '2005-09-01',
'Customer complaint: rabbit has been able to detect trap, food apparently less effective now.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(111, 'SLING', '2005-09-02',
'Shipped unassembled, requires common tools (including oversized hammer).'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(112, 'SAFE', '2005-09-02',
'Customer complaint:
Circular hole in safe floor can apparently be easily cut with handsaw.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(113, 'ANV01', '2005-09-05',
'Customer complaint:
Not heavy enough to generate flying stars around head of victim. If being purchased for dropping, recommend ANV02 or ANV03 instead.'
);
INSERT INTO productnotes(note_id, prod_id, note_date, note_text)
VALUES(114, 'SAFE', '2005-09-07',
'Call from individual trapped in safe plummeting to the ground, suggests an escape hatch be added.
Comment forwarded to vendor.'
);
注意上面的FULLTEXT(note_text)和ENGINE=MyISAM。
- FULLTEXT():给出被索引的列,可指定多个列。
- ENGINE=MyISAM:指定MyISAM引擎类型。
所以可得:上面创建表时,的FULLTEXT(note_text)指定note_text的列为索引列,即为了进行全文本搜索的列。
在定义后,MySQL自动维护该索引,在增加、更新、删除行时索引随着自动更新。
FULLTEXT也可以在创建表后添加,使用ALTER TABLE来添加。
注意:不要再导入数据时使用FULLTEXT,更新索引要花更多时间。如果正在导入数据到一个新表,此时不应该启用FULLTEXT索引。应该先导入所有数据,然后再修改表,定义FULLTEXT。这样有助于更快地导入数据(而且使索引数据的总时间小于在导入每行时分别进行索引所需的总时间)
14.1.2 使用全文本搜索
在索引后,使用两个函数MATCH()和AGAINST()执行全文本搜索,其中MATCH()指定被搜索的列,AGAINST()指定要使用的搜索表达式。
举个栗子:
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('rabbit');
输出:
+---------------------------------------------------------------------------------------------------------------------+
| note_text |
+---------------------------------------------------------------------------------------------------------------------+
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. |
| Quantity varies, sold by the sack load. All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
+---------------------------------------------------------------------------------------------------------------------+
传递给MATCH()的值必须与FULLTEXT()定义中的相同。如果指定多个列,则必须列出它们(而且次序正确)。
搜索不区分大小写,除非使用BINARY关键字。
上面的例子也可以使用LIKE子句来完成:
SELECT note_text
FROM productnotes
WHERE note_text LIKE '%rabbit%';
输出:
+---------------------------------------------------------------------------------------------------------------------+
| note_text |
+---------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load. All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. |
+---------------------------------------------------------------------------------------------------------------------+
上面的例子都没有包含ORDER BY子句,使用LIKE子句以不特别有用的顺序返回数据。而全文本搜索返回以文本匹配的良好程序排序的数据。在全文本搜索的一个重要部分就是对结果排序,具有较高等级的行先返回。(就像上面的例子,两行都包含词rabbit,但是包含词rabbit作为第3个词的行的等级比作为第20各词的行高)
可以演示一下全文本搜索匹配词rabbit的优先级:
SELECT note_text,
MATCH(note_text) AGAINST('rabbit') AS rank
FROM productnotes;
输出:
+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| note_text | rank |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
| Customer complaint:Sticks not individually wrapped, too easy to mistakenly detonate all at once.Recommend individual wrapping. | 0 |
| Can shipped full, refills not available.Need to order new can if refill needed. | 0 |
| Safe is combination locked, combination not provided with safe.This is rarely a problem as safes are typically blown up or dropped by customers. | 0 |
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. | 1.5905543565750122 |
| Included fuses are short and have been known to detonate too quickly for some customers.Longer fuses are available (item FU1) and should be recommended. | 0 |
| Matches not included, recommend purchase of matches or detonator (item DTNTR). | 0 |
| Please note that no returns will be accepted if safe opened using explosives. | 0 |
| Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils. | 0 |
| Item is extremely heavy. Designed for dropping, not recommended for use with slings, ropes, pulleys, or tightropes. | 0 |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. | 1.6408053636550903 |
| Shipped unassembled, requires common tools (including oversized hammer). | 0 |
| Customer complaint:Circular hole in safe floor can apparently be easily cut with handsaw. | 0 |
| Customer complaint:Not heavy enough to generate flying stars around head of victim. If being purchased for dropping, recommend ANV02 or ANV03 instead. | 0 |
| Call from individual trapped in safe plummeting to the ground, suggests an escape hatch be added.Comment forwarded to vendor. | 0 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
14 rows in set (0.09 sec)
此演示可以看到rank列是全文本搜索计算出的等级值。等级是由MySQL根据行中词的数目、唯一词的数目、整个索引中词的总数以及包含该词的行的数目计算出来。所以,上面中,不包含rabbit的行等级为0,包含词rabbit的两个行都有一个等级值,文本中词靠前的行的等级值比词靠后的行的等级值高。
如果是指定多个搜索项,则包含多数匹配词的那些行将具有比包含较少词的那些行高的等级值。
14.1.3 使用查询扩展
查询扩展是放宽所返回的全文本搜索结果的范围。比如,想找出anvils的注释,只有一个注释包含词anvils,但有时还想找出可能与该搜索有关的其他行,即使它们不包含anvils。
这就是查询扩展。在使用查询扩展时,MySQL对数据和索引进行两遍扫描来完成搜索:
- 首先,进行一个基本的全文本搜索,找出与搜索条件匹配的所有行;
- 其次,MySQL检查这些匹配行并选择所有有用的词;
- 最后,MySQL再次进行全文本搜索,这次不仅使用原来的条件,而且还使用所有有用的词。
查询扩展是MySQL版本4.1.1引入。
下面举个例子:先进行一个简单的全文本搜索,没有查询扩展:
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('anvils');
输出:
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| note_text |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils. |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.04 sec)
下面使用查询扩展:
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('anvils' WITH QUERY EXPANSION);
SELECT note_text, MATCH(note_text) AGAINST(‘anvils’ WITH QUERY EXPANSION) as rank
FROM productnotes;
输出:
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| note_text |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Multiple customer returns, anvils failing to drop fast enough or falling backwards on purchaser. Recommend that customer considers using heavier anvils. |
| Please note that no returns will be accepted if safe opened using explosives. |
| Customer complaint:Sticks not individually wrapped, too easy to mistakenly detonate all at once.Recommend individual wrapping. |
+----------------------------------------------------------------------------------------------------------------------------------------------------------+
3 rows in set (0.04 sec)
解释:查询扩展在AGAINST()中使用了WITH QUERY EXPANSION关键字。这次返回了3行,第一行是包含词anvils,因此等级最高。第二行与anvils无关,但是它包含第一行中的两个词returns和using。第三行包含了Customer和Recommend两词,但是这两次分开得很远,所以结果排序靠后。
《MySQL必知必会》返回了7行,我很奇怪,确实数据应该是有7行,其中6行是相关行。保留疑问??
解决:是我把字符序(校对顺序)设置成utf8_bin,也就是区分大小写的原因。
14.1.4 使用布尔查询
MySQL支持全文本搜索得另外一种形式,称为布尔方式(boolean mode)。以布尔方式,可以提供关于如下内容的细节:
- 要匹配的词;
- 要排斥的词(如果某行包含这个词,则不返回该行,即使它包含其他指定的词也是如此);
- 排列提示(指定某些词比其他词更重要,更重要的词等级更高);
- 表达式分组;
即使没有FULLTEXT索引,布尔方式也是可以使用的,但是这是一种非常缓慢的操作(其性能将随着数据量的增加而降低)。
使用布尔方式,需要学习以下的布尔操作符:
布尔操作符 | 说明 |
+ | 包含,词必须存在 |
- | 排除,词必须不出现 |
> | 包含,而且增加等级值 |
< | 包含,且减少等机值 |
() | 把词组成子表达式(允许这些子表达式作为一个组被包含、排除、排列等) |
~ | 取消一个词的排序值 |
* | 词尾的通配符 |
“” | 定义一个短语(与单个词的列表不一样,它匹配整个短语以便包含或排除这个短语) |
并且使用布尔方式,得使用IN BOOLEAN MODE关键字。
下面举些例子:
- 搜索匹配包含词rabbit和bait的行。
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('+rabbit +bait' IN BOOLEAN MODE);
输出:
+---------------------------------------------------------------------------------------------------------------------+
| note_text |
+---------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
+---------------------------------------------------------------------------------------------------------------------+
1 row in set (0.07 sec)
- 假设没有指定操作符,这个搜索匹配包含rabbit和bait中的至少一个词的行。
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('rabbit bait' IN BOOLEAN MODE);
输出:
+---------------------------------------------------------------------------------------------------------------------+
| note_text |
+---------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. |
+---------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.06 sec)
- 搜索匹配短语rabbit bait 而不是匹配两个词rabbit和bait。
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('"rabbit bait"' IN BOOLEAN MODE);
输出:
+---------------------------------------------------------------------------------------------------------------------+
| note_text |
+---------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
+---------------------------------------------------------------------------------------------------------------------+
1 row in set (0.08 sec)
- 匹配rabbit和carrot,增加前者的等级,降低后者的等级。
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('>rabbit <carrot' IN BOOLEAN MODE);
输出:
+---------------------------------------------------------------------------------------------------------------------+
| note_text |
+---------------------------------------------------------------------------------------------------------------------+
| Quantity varies, sold by the sack load.All guaranteed to be bright and orange, and suitable for use as rabbit bait. |
| Customer complaint: rabbit has been able to detect trap, food apparently less effective now. |
+---------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.06 sec)
- 搜索匹配词safe和combination,降低后者的等级。
SELECT note_text
FROM productnotes
WHERE MATCH(note_text) AGAINST('+safe +(<combination)' IN BOOLEAN MODE);
输出:
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| note_text |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
| Safe is combination locked, combination not provided with safe.This is rarely a problem as safes are typically blown up or dropped by customers. |
+--------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.08 sec)
14.1.5 总结
- 在索引全文本数据时,短词被忽略且从索引中排除。短词定义为那些具有3个或3个以下字符的词(如果需要,这个数目可以改)。
- MySQL带有一个内建的非用词(stopword)列表,这些词在索引全文本数据时总是被忽略。如果需要,可以覆盖这个列表(这个得参考MySQL文档)
- 许多词出现的频率很高,搜索它们没有用处(返回太多的结果)。因此,MySQL规定了一条50%规则,如果一个词出现在50%以上的行中,则将它作为一个非用词忽略。50%规则不用于IN BOOLEAN MODE。
- 忽略词中的单引号,比如:don’t索引为dont。
- 不具有词分隔符(包括日语和汉语)的语言不能恰当地返回全文本搜索结果。
- 使用全文本搜索必须使用引擎:MyISAM。MySQL 5.6以后也可以用在Innodb表引擎中了。
- 仅能再char、varchar、text类型的列上面创建全文索引。
- 注意FULLTEXT索引要在导完数据后再定义FULLTEXT是哪(些)列,否则很耗时。
注意:这里是MySQL5.0版本的全文本搜索。