FAISS库如何构建层级索引 faiss索引原理

转载

jack 2024-06-30 22:53:28

文章标签 FAISS库如何构建层级索引机器学习搜索聚类汉明距离 文章分类 数据仓库大数据

在前面的文章中已经有说明，Faiss库的运行是基于索引的，这个索引与传统数据库中的Index不同，它是包含向量集，训练和查询方法等的类。

1. Index类汇总

Method	Class name	index_factory	Main parameters	Bytes/vector	Exhaustive	Comments
Exact Search for L2	IndexFlatL2	"Flat"	d	4*d	yes	brute-force
Exact Search for Inner Product	IndexFlatIP	"Flat"	d	4*d	yes	also for cosine (normalize vectors beforehand)
Hierarchical Navigable Small World graph exploration	IndexHNSWFlat	'HNSWx,Flat`	d, M	4d + 8 M	no
Inverted file with exact post-verification	IndexIVFFlat	"IVFx,Flat"	quantizer, d, nlists, metric	4*d	no	Take another index to assign vectors to inverted lists
Locality-Sensitive Hashing (binary flat index)	IndexLSH	-	d, nbits	nbits/8	yes	optimized by using random rotation instead of random projections
Scalar quantizer (SQ) in flat mode	IndexScalarQuantizer	"SQ8"	d	d	yes	4 bit per component is also implemented, but the impact on accuracy may be inacceptable
Product quantizer (PQ) in flat mode	IndexPQ	"PQx"	d, M, nbits	M (if nbits=8)	yes
IVF and scalar quantizer	IndexIVFScalarQuantizer	"IVFx,SQ4" "IVFx,SQ8"	quantizer, d, nlists, qtype	SQfp16: 2 * d, SQ8: d or SQ4: d/2	no	there are 2 encodings: 4 bit per dimension and 8 bit per dimension
IVFADC (coarse quantizer+PQ on residuals)	IndexIVFPQ	"IVFx,PQy"	quantizer, d, nlists, M, nbits	M+4 or M+8	no	the memory cost depends on the data type used to represent ids (int or long), currently supports only nbits <= 8
IVFADC+R (same as IVFADC with re-ranking based on codes)	IndexIVFPQR	"IVFx,PQy+z"	quantizer, d, nlists, M, nbits, M_refine, nbits_refine	M+M_refine+4 or M+M_refine+8	no

1.1 基本索引方法

Flat Index (IndexFlat*)

Flat索引只是将向量简单的编码为固定大小的代码，并将它们存储在 ntotal * code_size的数组中，不对向量数据进行压缩或折叠等操作。

在搜索时，对所有的索引向量依次与查询向量比较。

Cell-probe methods (IndexIVF* indexes)

IVF会采用一些办法（最典型的如K-mean之类的分区技术）来提高搜索速度，但这会使得搜索的结果未必是最近邻的结果。这种方法有时候被称为cell-probe 方法。

Faiss使用基于分区的多探测方法：

特征空间被划分为nlist个单元；
使用hash函数，将database向量分配给这些单元并存储在由nlist个倒序列表构成的倒序文件结构中。
查询时选择nprobe个倒序列表；
将查询向量与这nprobe个列表进行比较；

采用这种方法只会对database的一小部分(nprobe / nlist)进行比较，但是因为划分簇时采用了倒序排列的方式，所以尽管比较很少但仍能得到近邻比较。

IndexHNSW及其变体

HNSW全称为Hierarchical Navigable Small World，这类索引会对索引数据生成图，在搜索时，会以尽快收敛到最近邻的方式浏览图。IndexHNSW以Flat Index的方式存储，以便快速访问database的向量。

IndexHNSW的参数：

m：表示图中邻居的个数，m值越大，结果越精确，但占用内存更多；
efConstruction ： add阶段的搜索深度；
efSearch ：查询阶段的搜索深度；

IndexLSH

目前最流行的cell-probe 方法是逻辑敏感哈希算法（Locality Sensitive Hashing method），这类算法有两个弊端：

需要大量哈希函数才能得到想要的结果，也就意味着需要占用大量内存；
哈希函数不适用于输入数据；

在Faiss库中，IndexLSH以Flat Index的二进制代码形式保存，database向量和查询向量以哈希的形式散列在内存中，比较时计算汉明距离。

1.2 二进制索引

IndexBinary是二进制索引，其表示的向量集中每个维度只有 0 和 1 两种数值，查询时计算待查向量与DataBase的汉明距离。

向量在内存中按字节存储，每个维度占用1bit，总的内存空间为(dimension / 8)bytes，所以这类向量只支持维度为8的整数倍的向量集，否则需要对向量进行扩展或压缩。

IndexBinaryFlat

穷举搜索，在查询时会计算索引集中所有向量，该索引针对256维向量进行了特殊优化。

示例（python）

import faiss

# 向量维度
d = 256

# 建立索引的向量，可视为DataBase
db = ...

# 需从Index中查询的目标向量
queries = ...    

# 初始化Index
index = faiss.IndexBinaryFlat(d)

# 添加db到index中
index.add(db)

# 每个查询向量要检索的最近邻居数
k = ...;

# 查询索引， D是存放最近K个距离的容器， I是D中每个距离的向量在db中的下标
D, I = index.search(queries, k)

IndexBinaryIVF

该索引会对向量进行聚类来加快查询速度。聚类搜索需要先进行训练，相当于无监督学习。

示例(Python)

import faiss

# 向量维度
d = 256

# 建立索引的向量，可视为DataBase
db = ...

# 训练用的向量集
training = ...

# 需从Index中查询的目标向量
queries = ...

# 初始化量化器
quantizer = faiss.IndexBinaryFlat(d)

# 设置向量集中簇的个数
nlist = ...

# 初始化索引
index = faiss.IndexBinaryIVF(quantizer, d, nlist)
# 设置每次查询的簇的数量
index.nprobe = 4

# 训练
index.train(training)

# 将向量集添加进索引
index.add(db)

# 每个查询向量要检索的最近邻居数
k = ...

# 查询索引， D是存放最近K个距离的容器， I是D中每个距离的向量在db中的下标
D, I = index.search(queries, k)

训练过程主要是量化器学习对数据进行聚类，划分为nlist个簇。这里采用的聚类方法是按距离划分。

IndexBinaryIVF索引查询与IndexBinaryFlat的方式不同，前者是基于聚类的。

1.3 混合索引

混合索引表示使用了上述多种索引方法以增强查询性能。

IndexPQ实例

m = 16                                   # number of subquantizers
n_bits = 8                               # bits allocated per subquantizer
pq = faiss.IndexPQ (d, m, n_bits)        # Create the index
pq.train (x_train)                       # Training
pq.add (x_base)                          # Populate the index
D, I = pq.search (x_query, k)            # Perform a search

IndexIVFPQ实例

coarse_quantizer = faiss.IndexFlatL2 (d)
index = faiss.IndexIVFPQ (coarse_quantizer, d,
                          ncentroids, code_size, 8)
index.nprobe = 5

粗略PQ索引

乘积量化器(PQ)也可以作为初始索引，对应于多索引[The inverted multi-index, Babenko & Lempitsky, CVPR'12]，对于具有m个段（每个段编码为c个质心）的PQ，反向列表的数量为c ^ m。因此，m = 2是唯一可行的选择。

在FAISS中，相应的粗略量化器索引是MultiIndexQuantizer。该索引没有添加向量。因此，必须在IndexIVF上设置特定标志（quantizer_trains_alone）。

nbits_mi = 12  # c
M_mi = 2       # m
coarse_quantizer_mi = faiss.MultiIndexQuantizer(d, M_mi, nbits_mi)
ncentroids_mi = 2 ** (M_mi * nbits_mi)

index = faiss.IndexIVFFlat(coarse_quantizer_mi, d, ncentroids_mi)
index.nprobe = 2048
index.quantizer_trains_alone = True

预过滤PQ索引

比较汉明距离比使用PQ算法快6倍，但是通过对量化质心进行适当的重新排序，PQ码之间的汉明距离将与真实距离相关。通过在汉明距离上应用阈值，可以避免最昂贵的PQ代码比较。

#
#For an IndexPQ:
#
index = faiss.IndexPQ (d, 16, 8)
# before training
index.do_polysemous_training = true
index.train (...)
# before searching
index.search_type = faiss.IndexPQ.ST_polysemous
index.polysemous_ht = 54    # the Hamming threshold
index.search (...)

#
#For an IndexIVFPQ:
#
index = faiss.IndexIVFPQ (coarse_quantizer, d, 16, 8)
# before training
index. do_polysemous_training = true
index.train (...)

# before searching
index.polysemous_ht = 54 # the Hamming threshold
index.search (...)

设置阈值的原则：

阈值应在0和每个代码的位数之间（在这种情况下为128 = 16 * 8），并且代码遵循二项式分布；
阈值设置应小于代码位数的1/2。

2. 如何选择索引

Faiss提供的索引多种多样，虽然它们都能完成既定的任务，但是对于不同的应用场景，表现的性能差异是非常大的。针对不同的应用场景，如何选择最好的索引，可以遵循下列因素：

搜索量

如果搜索量比较小（比如1000 - 10000），那么建立索引的时间在整个搜索过程中的占比会比较大，所以可以直接计算。

结果精确度

如果对结果要求绝对的准备，那么选择带有"Flat"关键字的索引。只有IndexFlatL2 或 IndexFlatIP能保证绝对的准确。它们为其他索引的结果提供了基准值。

内存敏感性

如果对内存没有要求，选择"HNSWx"，其中x（4 - 64）表示每个向量的link数，x值越大，结果越精确，占用内存也越大；
要求不高，选择"... , Flat"，其中"..."表示先对数据进行聚类。
要求比较高，选择"PCARx,...,SQ8"
对内存要求非常高，选择"OPQx_y,...,PQx"

dataset大小

小于1M，选择"...,IVFx,..."
1M - 10M之间，选择 "...,IVF65536_HNSW32,..."
10M - 100M:之间，选择"...,IVF262144_HNSW32,..."
100M - 1B:之间，选择"...,IVF1048576_HNSW32,..."

3. 向量方法

Faiss的前后处理主要进行重新映射向量ID，对数据进行转换，并使用更好的索引对搜索结果重新排序等操作。

3.1 向量映射

默认情况下Faiss按顺序将向量添加到索引，并设置ID。一些Index类实现了add_with_ids方法，其中除了向量之外，还可以提供64位向量ID。在搜索时，该类将返回存储的ID，而不是初始向量。

映射ID可以使用IndexIDMap方法，该方法封装了另一个索引，并在添加和搜索时转换ID。它维护带有映射的表。

示例

index = faiss.IndexFlatL2(xb.shape[1]) 
ids = np.arange(xb.shape[0])
index.add_with_ids(xb, ids)  # this will crash, because IndexFlatL2 does not support add_with_ids
index2 = faiss.IndexIDMap(index)
index2.add_with_ids(xb, ids) # works, the vectors are stored in the underlying index

3.2 预转换数据

由于输入数据量过大等原因，在对数据进行索引前，往往需要进行数据转换，如压缩，降维等。所有的数据预转换方法都继承自VectorTransform类。该类对维度为d_in的输入向量进行转换，输出维度为d_out的输出向量。

数据预转换方法如下表

Transformation	Class name	Comments
随机转换	RandomRotationMatrix	useful to re-balance components of a vector before indexing in an IndexPQ or IndexLSH
remapping of dimensions	RemapDimensionsTransform	to reduce or increase the size of a vector because the index has a preferred dimension, or to apply a random permutation on dimensions.
PCA	PCAMatrix	for dimensionality reduction
OPQ rotation	OPQMatrix	OPQ applies a rotation to the input vectors to make them more amenable to PQ coding. See Optimized product quantization, Ge et al., CVPR'13 for more details.

PCAMatrix ：使用PCA降维示例

将向量维度从2048D减到16字节

# the IndexIVFPQ will be in 256D not 2048
  coarse_quantizer = faiss.IndexFlatL2 (256)
  sub_index = faiss.IndexIVFPQ (coarse_quantizer, 256, ncoarse, 16, 8)
  # PCA 2048->256
  # also does a random rotation after the reduction (the 4th argument)
  pca_matrix = faiss.PCAMatrix (2048, 256, 0, True) 

  #- the wrapping index
  index = faiss.IndexPreTransform (pca_matrix, sub_index)

  # will also train the PCA
  index.train(...)
  # PCA will be applied prior to addition
  index.add(...)

RemapDimensionsTransform：增加维度

可以通过添加零值的方式给向量增加维度

# input is in dimension d, but we want a multiple of M
  d2 = int((d + M - 1) / M) * M
  remapper = faiss.RemapDimensionsTransform (d, d2, true)
  # the index in d2 dimensions  
  index_pq = faiss.IndexPQ(d2, M, 8)  
  
  # the index that will be used for add and search 
  index = faiss.IndexPreTransform (remapper, index_pq)

IndexRefineFlat：重新排列搜索结果

查询向量时，使用实际距离计算对搜索结果重新排序可能会很有用。以下示例使用IndexPQ搜索索引，然后通过计算实际距离对第一个结果进行排名：

q = faiss.IndexPQ (d, M, nbits_per_index)
  rq = faiss.IndexRefineFlat (q)
  rq.train (xt)
  rq.add (xb)
  rq.k_factor = 4
  D, I = rq:search (xq, 10)

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。