在sphinx中应用复杂过滤条件

原创

superhuanghai 2021-08-11 10:14:48 ©著作权

文章标签 sphinx sql 数据 mysql 配置文件 文章分类 MySQL 数据库

©著作权归作者所有：来自51CTO博客作者superhuanghai的原创作品，请联系作者获取转载授权，否则将追究法律责任

一、问题的引入

在sphinx应用中，需要对数据进行复杂的条件过滤，刷选出我们需要的数据。这个过程，等同于mysql查询中的where条件。

但sphinx本身的filter并不能支持复杂的逻辑组合运算，各个过滤条件之间只能进行AND操作，连OR操作都不支持。

为了解决这个问题，之前我们是在建立索引的时候，从源数据库取出数据建立索引的时候，利用mysql的sql语句，在数据导入的时候就要把逻辑摊开。

这样会带来很多问题：

1）sphinx配置文件复杂，变动不灵活，当业务发生变化的时候，需要修改后台的sphinx配置文件，重建索引；

2）没法做到业务表现和数据分离，前后台耦合严重；

3）从源db获取数据的sql语句复杂，不方便维护调整，同时增加建立索引的时间

4）sphinx的开发维护人员，必须非常清楚业务的特点才能实现，不利于各人的分工协作

5）当逻辑实在太复杂，数据导入的时候没法将逻辑摊开的时候，采用这种方式，业务就没法实现了

二、新的解决方案

新的解决方案，利用SetSelect()和SetFilter()这两个api，对sphinx的参数进行逻辑操作来实现。

下面根据例子来说明。

1. 准备测试环境和数据

创建表格和插入数据
CREATE TABLE `t_test_filter` (
`Fcommodity_id` bigint(20) unsigned NOT NULL DEFAULT '0',
` FUserId` int(10) unsigned NOT NULL DEFAULT '0',
` FUserName` varchar(32) NOT NULL DEFAULT '',
`Fshopid` bigint(20) unsigned NOT NULL DEFAULT '0',
`Ftitle` varchar(60) NOT NULL DEFAULT '',
`FProperty` int(10) unsigned NOT NULL DEFAULT '0',
`FTime` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`Fcommodity_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

创建sphinxSE的表格

CREATE TABLE `t_test_filter_sphinxSE` (
`id` bigint(20) unsigned NOT NULL,
`weight` int(11) DEFAULT '1',
`query` varchar(3072) NOT NULL,
`total_found` int(11) DEFAULT '0',
`total_match` int(11) DEFAULT '0',
`query_time` int(11) DEFAULT '0',
` FUserId` int(10) unsigned DEFAULT '0',
`Fshopid` bigint(20) unsigned NOT NULL DEFAULT '0',
`FProperty` int(10) unsigned NOT NULL DEFAULT '0',
`FTime` int(10) unsigned NOT NULL DEFAULT '0',
KEY `query` (`query`(1536))
) ENGINE=SPHINX DEFAULT CHARSET=gbk CONNECTION='sphinx://localhost:9412'

插入测试数据
insert into t_test_filter values(4567,123,"test qq nick", 1234567,"title name",2324354,9876);
insert into t_test_filter values(2,34,"nick name", 4545,"title name",67,40);
insert into t_test_filter values(66,55,"test3 qq3 nick3", 22,"title3 name3 ",88,77);
insert into t_test_filter values(767,234,"test3 qq nick", 3433,"title3 name3 ",342342,89340);
insert into t_test_filter values(4545,123,"test qq nick", 1234567,"title name",2324354,9876);
insert into t_test_filter values(4546,778,"test qq nick", 1234567,"title name",2324354,9876);

2. 使用举例

这里用sphinxSE的接口，举2个例子来说明这种方法：

（1）查询词是nick，满足条件 FUserId=34 || Fshopid = 1234567;
通过sphinxSE测试查询
select * from t_test_filter_sphinxSE where query="nick;select=IF( FUserId=34,1,0) + IF(Fshopid=1234567,10,0) as match_qq; filter= match_qq, 1,10,11";

（2）查询词是nick，满足条件 FUserId=34 || (Fshopid = 1234567 && FUserId=778);
select * from t_test_filter_sphinxSE where query="nick;select=IF( FUserId=34,1,0) + IF(Fshopid=1234567,10,0) + IF( FUserId=778,100,0) as match_qq; filter= match_qq, 1,110,111";

为了减少组合导致的过滤条件太多，可以使用诸如下面这种逻辑运算来简化

select * from t_test_filter_sphinxSE where query="nick;select=IF( FUserId=34,1,0) OR ( IF(Fshopid=1234567,1,0) AND IF( FUserId=778,1,0)) as match_qq; filter= match_qq, 1";

这两条语句的作用相同。

其他复杂的查询，可以按照这个原理组合实现。

原理：

1. 函数IF()的说明：它接受3个参数，检查第一个参数是否为0.0，若非零则返回第二个参数，为零时则返回第三个参数；
2. 上面的语法主要是query里面的语句。
(1) 第一个词是nick，表示需要查询到关键词。
(2)select语句，这个等同于api接口中的SetFilter()。这个是从配置文件的sql_attr_xxxx中，过滤返回的数据。select计算的结果，保存在某个变量名(match_qq)中。这里的计算支持+, -, *, /, <, > <=, >=, =, <>等操作符。
(3)filter语句，这个等同于api接口中的SetFilter()。根据select的结果，用filter过滤。不同的查询条件可以得到不同的组合。这样就可以支持非常复杂的查询过滤条件了。

其他接口可以类似，比如c-api和php的接口，可以使用SetSelect()和SetFilter()这两个api实现。

3. 优点和局限

1.在SetSelect()中支持比较负责数学运算和一些逻辑组合，可以灵活实现各种逻辑；

2.业务开发人员和sphinx的运维配置人员，可以分开独立

3.版本限制

(1) sphinxSE的接口，需要在coreseek4.0.1或者sphinx0.9.9rc1以上的版本才支持。
(2) c-api和php的接口，在coreseek3.2.14应该已经支持，他已经提供了SetSelect()和SetFilter()这两个api，但我没有实试过。

三、附录

这里附上sphinx配置文件，以便大家更好的分析和测试

# file: t_test_filter.conf

source t_test_filter
{
type = mysql

        sql_host                                = localhost
        sql_user                                = xxxx
        sql_pass                                = xxx
        sql_db                                  = test
        sql_port                                = 3306

sql_query = SELECT Fcommodity_id, FUserId, FUserName,Fshopid, Ftitle, FProperty,FTime from t_test_filter

        sql_attr_uint                   = FUserId
        sql_attr_bigint                   = Fshopid
        sql_attr_uint                   = FProperty
        sql_attr_uint                   = FTime
}

index t_test_filter
{
        source                  = t_test_filter
        path                    = /usr/local/coreseek/var/data/t_test_filter
        docinfo                 = extern
        mlock                   = 0
        morphology              = none
        exceptions              = /usr/local/coreseek/var/exceptions.txt
        min_word_len            = 1
        charset_type = zh_cn.gbk
        charset_dictpath = /usr/local/coreseek/dict
        ngram_len                               = 1
        ngram_chars                     = U+3000..U+2FA1F
        html_strip                              = 0
        blend_mode              = trim_none
}

#############################################################################
## indexer settings
#############################################################################

indexer
{
        # memory limit, in bytes, kiloytes (16384K) or megabytes (256M)
        # optional, default is 32M, max is 2047M, recommended is 256M to 1024M
        mem_limit                       = 40M

        # maximum IO calls per second (for I/O throttling)
        # optional, default is 0 (unlimited)
        #
        # max_iops                      = 40

        # maximum IO call size, bytes (for I/O throttling)
        # optional, default is 0 (unlimited)
        #
        # max_iosize            = 1048576

        # maximum xmlpipe2 field length, bytes
        # optional, default is 2M
        #
        # max_xmlpipe2_field    = 4M

        # write buffer size, bytes
        # several (currently up to 4) buffers will be allocated
        # write buffers are allocated in addition to mem_limit
        # optional, default is 1M
        #
        write_buffer            = 1M
}

#############################################################################
## searchd settings
#############################################################################

searchd
{
        # hostname, port, or hostname:port, or /unix/socket/path to listen on
        # multi-value, multiple listen points are allowed
        # optional, default is 0.0.0.0:9312 (listen on all interfaces, port 9312)
        #
        # listen                                = 127.0.0.1
        # listen                                = 192.168.0.1:9312
        listen                          = 9412
        # listen                                = /var/run/searchd.sock
        listen = localhost:9406:mysql41

binlog_path = /usr/local/coreseek/var/data

        # log file, searchd run info is logged here
        # optional, default is 'searchd.log'
        log                                     = /usr/local/coreseek/var/log/searchd_9412.log

        # query log file, all search queries are logged here
        # optional, default is empty (do not log queries)
        query_log                       = /usr/local/coreseek/var/log/query_9412.log

        # client read timeout, seconds
        # optional, default is 5
        read_timeout            = 5

        # request timeout, seconds
        # optional, default is 5 minutes
        client_timeout          = 300

        # maximum amount of children to fork (concurrent searches to run)
        # optional, default is 0 (unlimited)
        max_children            = 30

        # PID file, searchd process ID file name
        # mandatory
        pid_file                        = /usr/local/coreseek/var/log/searchd_9412.pid

        # max amount of matches the daemon ever keeps in RAM, per-index
        # WARNING, THERE'S ALSO PER-QUERY LIMIT, SEE SetLimits() API CALL
        # default is 1000 (just like Google)
        max_matches                     = 1000000

        # seamless rotate, prevents rotate stalls if precaching huge datasets
        # optional, default is 1
        seamless_rotate         = 1

        # whether to forcibly preopen all indexes on startup
        # optional, default is 0 (do not preopen)
        preopen_indexes         = 0

        # whether to unlink .old index copies on succesful rotation.
        # optional, default is 1 (do unlink)
        unlink_old                      = 1

        # attribute updates periodic flush timeout, seconds
        # updates will be automatically dumped to disk this frequently
        # optional, default is 0 (disable periodic flush)
        #
        # attr_flush_period     = 900

        # instance-wide ondisk_dict defaults (per-index value take precedence)
        # optional, default is 0 (precache all dictionaries in RAM)
        #
        # ondisk_dict_default   = 1

        # MVA updates pool size
        # shared between all instances of searchd, disables attr flushes!
        # optional, default size is 1M
        mva_updates_pool        = 1M

        # max allowed network packet size
        # limits both query packets from clients, and responses from agents
        # optional, default size is 8M
        max_packet_size         = 8M

        # crash log path
        # searchd will (try to) log crashed query to 'crash_log_path.PID' file
        # optional, default is empty (do not create crash logs)
        #
        # crash_log_path                = /usr/local/coreseek/var/log/crash

        # max allowed per-query filter count
        # optional, default is 256
        max_filters                     = 256

        # max allowed per-filter values count
        # optional, default is 4096
        max_filter_values       = 4096

        # socket listen queue length
        # optional, default is 5
        #
        # listen_backlog                = 5

        # per-keyword read buffer size
        # optional, default is 256K
        #
        # read_buffer                   = 256K

        # unhinted read size (currently used when reading hits)
        # optional, default is 32K
        #
        # read_unhinted         = 32K
}

# --eof--

上一篇：删除一个存在的RabbitMQ队列

下一篇：检查Rabbitmq中队列及消息个数，还有清空的方法

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯