Keywords

1 Introduction

With the rapid development of computer technology and internet application, data in many areas are growing exponentially, thus the demand for large and scalable storage and computation is becoming urgent. More and more enterprises and individuals outsource their storage and computation to the cloud for using data anytime and anywhere and saving costs of hardware and software [1]. However, while enjoying the benefits of cloud computing, users have to face the risk that sensitive outsourced data could be leaked or abused because cloud service providers can access the data without authorization. Therefore, data owners usually encrypt data before outsourcing [2]. Although encryption preserves the security of data, it also affects the data availability. In this scenario, searchable encryptions (SE) [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] that guarantee the security and availability of data have been proposed.

At present, most solutions are based on the vector space model (VSM) and TF-IDF model which extract keywords of documents into “points” in multi-dimensional space and describe the relevance scores between documents and search keywords. The top-k documents are determined by comparing the relevance scores. However, if a scheme calculates the relevance scores between every document and the search keywords, it will cost a large amount of time and computing resources. To improve search efficiency, researchers have provided a variety of schemes. Song et al. [3] proposed the first SE scheme where users need to traverse entire documents while searching, and the search time is proportional to the amount of data set. Goh et al. [4] proposed a search scheme based on the Bloom Filter. Curtmola et al. [5] proposed an efficient SE scheme based on the inverted index, but using this scheme could expose the privacy of keywords.

Cao et al. [6] proposed a new structure to adapt to multi-keyword search, but it’s search time increases exponentially when document size grows. Xia et al. [8] proposed a secure and dynamic multi-keyword ranked search scheme which can reduce the large inner product calculation by pruning function. Jiang et al. [9] proposed a secure ciphertext search scheme based on the inverted index, which avoids calculating the relevance scores of irrelevant documents. Chen et al. [10] proposed a method based on data mining, which can achieve linear time complexity with the exponential growth of the document set. Our previous work [11] also proposed a hierarchical agglomerative clustering tree index scheme, which can perform an effective and verifiable ranked search.

However, the above existing schemes need to load the complete indexes into memory at one time for performing search. Because the index size is proportional to the number of documents, when the scale of documents grows to a certain level, the memory will be overflow. Moreover, the indexes of those schemes should be kept in integrity and cannot be segmented, which also limits their scalability. To conquer such limitation, we propose a parallel privacy-preserving top-k search (PPTS) scheme, which can meet the requirements of large scale data. First, we propose a fragment-based encrypted inverted index model. In data preprocessing and outsourcing phase, the indistinguishable fragment-based encrypted inverted indexes are constructed and outsourced to the cloud together with the encrypted documents. In the search phase, the Map-Reduce-based distributed computing framework is adopted and the parallel multi-keyword top-k search algorithms are proposed. After that, we analyze the security of PPTS and perform experiments to evaluate its efficiency.

The contributions of this paper are: (1) We present the fragment-based encrypted inverted index model which is indistinguishable through adding random paddings. (2) By adopting the Map-Reduce-based distributed computing framework, the parallel multi-keyword top-k search algorithms are proposed. (3) We analyze the security and evaluate the search performance. The result shows that the proposed scheme can realize parallel search while preserving data privacy.

2 Models and Problem Formulation

2.1 Notations and Preliminaries

  • D: The document set, \(D=\{d_1, d_2,...,d_n\}\). \(\widetilde{D}\) is the encrypted form.

  • n: The number of documents in D.

  • W: The dictionary, namely, the set of keywords, denoted as \(W=\big \{w_1,\)\( w_2,..., w_m\big \}\).

  • m: The number of keywords in W.

  • Q: The query consisting of a set of the search keywords, \(Q = \{w_1, w_2,..., w_q\}\).

  • \(V_{d_i}\): The m-dimensional document vector of \(d_i\). \(\widetilde{V}_{d_i}\) is the encrypted form.

  • V: The document vector set, \(V = \{V_{d_1}, V_{d_2},..., V_{d_n}\}\). \(\widetilde{V}\) is the encrypted form.

  • \(V_q\): The m-dimensional query vector for Q. \(\widetilde{V}_q\) is the encrypted form.

  • TD: The trapdoor for the search request.

  • RS: The result of the search.

  • \( P_{i,j}\): The posting corresponding to the document \(d_j\) containing keyword \(w_i\), \(P_{i,j}={<}id(d_j), V_{d_j}{>}\). \(\widetilde{P}_{i,j}\) is the encrypted form.

  • \(PL_i\): The posting list of keyword \(w_i\), \(PL_i=\{P_{i,1}, P_{i,2}, ..., P_{i,\delta }\}\). \(\widetilde{PL}_i\) is the encrypted form.

  • \(\delta \): The number of postings in \(PL_i\).

  • \(\varepsilon \): The fragmentation parameter.

  • F: The fragmented documents of D according to \(\varepsilon \), \(F=\{F_1, F_2,..., F_t\}\).

  • t: The number of fragments of F.

  • \(\beta \): The number of posting list in \(F_i\).

Vector Space Model (VSM) and TF-IDF Model. The VSM and TF-IDF are widely used in multi-keyword privacy-preserving top-k search [6,7,8,9,10,11,12,13]. The term frequency (TF) refers to the number of times a given keyword or term appears in documents, while the inverse document frequency (IDF) is equal to the total number of documents in the set divided by the number of documents containing a given keyword. VSM is used to convert a given document \(d_i\) and search keywords Q into vectors \(V_{d_i}\) and \(V_q\). The calculation of those vectors can be referred to [6,7,8,9,10,11,12,13].

Secure Inner Product Operation. This scheme uses the secure inner product operation to calculate the inner product of two encrypted vectors without knowing the plaintext value. The basic idea of this is as follows. Assuming that p and q are two n-dimensional vectors and M is a random \(n\times n\)-dimensional invertible matrix. M is treated as the secure key. The encrypted form of p and q are denoted as \(\widetilde{p}\) and \(\widetilde{q}\) respectively, where \(\widetilde{p}=pM^{-1} \) and \(\widetilde{q}=qM^T\). Then we have \(\widetilde{p}\cdot \widetilde{q} = (pM^{-1})\cdot (qM^T)=pM^{-1}(qM^T)^T = pM^{-1}Mq = p\cdot q\), i.e. \(\widetilde{p}\cdot \widetilde{q}=p\cdot q\). Therefore, we have that the inner product of two encrypted vectors equals the inner product of the corresponding two plaintext vectors.

Inverted Index. Inverted index can be used to quickly find those documents containing a given keyword by mapping to improves search efficiency. It consists of dictionary and posting list. The dictionary is a collection of all keywords that appeared in the D. Each index item in inverted index records a keyword and a pointer to the posting list, which is the entry of posting. The posting list records a list of all documents that contain a specified keyword. Each record in the posting list is a posting that describes the information of the document.

2.2 System Model

The system model is shown in Fig. 1, which is the same as [6,7,8,9,10,11, 14,15,16]. It includes three different entities. Data owners (DO) are responsible for constructing fragment-based encrypted inverted indexes (\(\widetilde{I}\)), and outsourcing the encrypted indexes and documents to the cloud server (CS). CS provide the search service in parallel according to the search request submitted by data users (DU). DU construct a search trapdoor based on its needs and send it to CS, then wait for CS to return the search results.

Fig. 1.
figure 1

The system model

2.3 Problem Description

We adopt the “Honest-but-Curious” threat model. In this model, CS honestly and correctly executes instructions in the designated protocol. However, CS can analyze stored data and try to snoop on sensitive information.

The search result of PPTS is represented as RS. \(V_q\) is the query vector of Q. \(V_{d_i}\) and \(V_{d_j}\) respectively represent the document vector of \(d_i\) and \(d_j\). Then, RS meets the requirement:

$$|RS|=k \wedge \forall d_i,d_j(d_i \in RS \wedge d_j \in (D - RS)) \rightarrow V_{d_i} \cdot V_q > V_{d_j} \cdot V_q.$$

The PPTS should satisfy three goals. First, the contents are directly seen by CS only include encrypted documents, indexes, and trapdoors, that is, the confidentiality of documents, indexes and trapdoors cannot be leaked. Second, PPTS can handle the search requirements of large document sets in parallel with Map-Reduce parallel search framework. Third, PPTS should fully guarantee the accuracy of search, that is, to improve the efficiency without reducing the accuracy.

2.4 Search Framework

To clearly describe the scheme proposed in this paper, we define a framework for the PPTS scheme. As shown in Fig. 2, the search model is composed of five modules: GenKey, Setup, BuildIndex, GenTrapdoor, and Search.

  • Genkey: DO generate the key for encryption, and share it with DU.

  • Setup: DO preprocess the document set D, generate a document vector for each document, and encrypt the D.

  • BuildIndex: DO fragment the D and then construct an indistinguishable inverted index to provide the CS to perform the search service.

  • GenTrapdoor: DU generate a trapdoor based on the search keywords.

  • Search: CS perform the top-k search service in parallel according to the TD and \(\widetilde{I}\), and return the RS that satisfy the condition to the DU.

Fig. 2.
figure 2

PPTS search framework

3 Parallel Privacy-Preserving Top-k Search Scheme

3.1 Fragment-Based Encrypted Inverted Indexes Model

Definition 1

Document Fragmentation. The document set D is divided into equal lengths according to the parameter \(\varepsilon \). The generated fragmented documents are denoted as \(F=\{F_1, F_2,..., F_t\}\), satisfying Formula 3 and 4.

$$\begin{aligned} |F_1| = |F_2| = ... = |F_{t-1}| = \varepsilon , 1 \le |F_t| \le \varepsilon \end{aligned}$$
(1)
$$\begin{aligned} D = F_1 \cup F_2 \cup ... \cup F_t \end{aligned}$$
(2)

where |X| represent the number of elements contained in the list or set X.

Definition 2

Parameter-(\(\delta , \beta \)). \(\delta \) is the maximum number of documents containing a certain keyword \(w_j\) in a certain fragment \(F_i\). And \(\beta \) is the maximum number of keywords contained in a certain fragment of the F.

$$\begin{aligned} {\left\{ \begin{array}{ll} \delta = max \{ |F_{v,j}|\ |\ 1< v< t, 1< j< m \}\\ \beta =max \{ |W_v|\ |\ 1<v<t, 1<j<m \} \end{array}\right. } \end{aligned}$$
(3)

where \(F_{v,j} \subset F_v\) is the document set containing the keywords \(w_j\) in fragment \(F_v\), and \(W_v \subset W\) is the keywords set contained in fragment \(F_v\).

Definition 3

Fragment-based Encrypted Inverted Indexes. \(\widetilde{I}=\{\widetilde{I}_1, \widetilde{I}_2, ..., \widetilde{I}_t\}\). Here, \(\widetilde{I}_v \in \widetilde{I} \) is an encrypted inverted index corresponding to fragment \(F_v\). Each row in \(\widetilde{I}_v\) is \({<} tag_j, \widetilde{PL}_j {>}\), corresponding to a keyword \(w_j\). \(tag_j\) is a hash-based message authentication code of \(w_j\) generated by key c, \(tag_j = hash(c, w_j)\). \(\widetilde{PL}_j\) is the encrypted posting list of \(w_j\). To protect the private information of keyword frequencies, we make the index \(\widetilde{I}_v\) corresponding to a fragment \(F_v \in F\) has the same number of rows and the posting list in each row has the same number of posting. Thus, the generated indexes \(\widetilde{I}\) indistinguishable. The construction of the index \(\widetilde{I}_v\) is given as follows:

  1. (1)

    For any \(d_i\in F_{v,j}\), the corresponding posting \(\widetilde{P}_{j,i}=\, <id(d_i),\widetilde{V}_{d_i}>\) is generated to form the \(\widetilde{PL}_j\). Here, \(id(d_i)\) is the id information of document \(d_i\). If \(|F_{v,j}| < \delta \), \(\delta - |F_{v,j}|\) different artificial padding \(\widetilde{P}_{j,s}=\, <id(d_s), \widetilde{V}_{d_s}{'}> \) are constructed to add to the \(\widetilde{PL}_j\). \(d_s\) represents a randomly selected document that satisfies \(d_s \in F_v - F_{v,j}\). \(V_{d_s}{'}\) represents a randomly generated m-dimensional vector, and the values of each dimension are as follows:

    $$\begin{aligned} V_{d_s}{'}[k] = {\left\{ \begin{array}{ll} 0, &{}k \ne j \\ rand(min\{V_{d_i}[k]\}), &{} k=j \wedge d_i \in F_{v,j} \\ \end{array}\right. } \end{aligned}$$
    (4)
  2. (2)

    For any \(w_j \in W_v\), the corresponding row \({<} tag_j, \widetilde{PL}_j {>}\) is generated according to the above steps to form the \(\widetilde{I}_v\). If \(|W_v| < \beta \), \(\beta - |W_v|\) different artificial rows \({<} tag_s, \widetilde{PL}_s {>}\) are constructed to add to the \(\widetilde{I}_v\). \(tag_s\) is generated by randomly selected keyword \(w_s\), where \(w_s \in W - W_v\). \(\widetilde{PL}_s\) is composed of \(\delta \) artificial padding generated by the above steps.

Fig. 3.
figure 3

Example of \(\widetilde{I}\)

We take an example to explain the above definitions. We assume \(D=\{d_i|i=1, ..., 10\}\) and \(W=\{w_1,w_2,w_3\}\). D is divided to \(F_1=\{d_1, d_2, d_3, d_4\}\), \(F_2=\{d_5, d_6, d_7, d_8\}\), \(F_3=\{d_9, d_{10}\}\). Then, \(\delta = 3\) and \(\beta =3 \) are calculated. Finally, the indexes \(\widetilde{I}=\{\widetilde{I_1}, \widetilde{I_2}, \widetilde{I_3}\}\) are generated as shown in Fig. 3.

3.2 Data Preprocessing and Outsourcing

The data preprocessing and outsourcing of PPTS are mainly performed by DO which include three algorithms: GenKey, Setup and BuildIndex.

\(\varvec{K \leftarrow GenKey(1^ \lambda )}\): On input a security parameter \(\lambda \), the key generation algorithm output the key K. DO randomly generate the key sk, \(c\in \{0,1\}^ \lambda \), an m-dimensional vector S and two \(m\times m\) invertible matrices \(M_1\), \(M_2\). Finally, the key \(K=(sk, c, S, M_1, M_2)\) is formed. K is shared between DO and DU but is private to CS.

\(\varvec{(\widetilde{V}, \widetilde{D}) \leftarrow Setup(D)}\): On input the document set D, this algorithm output the encrypted document vector set \(\widetilde{V}\) and encrypted document set \(\widetilde{D}\). DU encrypts document \(d_i\) into \(\widetilde{d_i}\) using sk. Then DU generate document vector \(V_{d_i}\) according to VSM and TF-IDF models. The key S is used to split the document vector \(V_{d_i}\) into \(V_{d_i}^{'}\) and \(V_{d_i}^{''}\) according to the following formula, and then the reversible matrices \(M_1\) and \(M_2\) are used to encrypt \(V_{d_i}\) to \(\widetilde{V_{d_i}}=(M_1^{T} V_{d_i}^{'}, M_2^{T} V_{d_i}^{''})\). Finally, the generated \(\widetilde{d_i}\) and \(\widetilde{V_{d_i}}\) are added to \(\widetilde{D}\) and \(\widetilde{V}\) respectively.

$$\begin{aligned} {\left\{ \begin{array}{ll} V_{d_i}^{'}[j]=V_{d_i}^{''}[j]=V_{d_i}[j], \quad &{}S[j]=0\\ V_{d_i}^{'}[j]+V_{d_i}^{''}[j]=D_i[j], \quad &{}S[j]=1 \end{array}\right. } \end{aligned}$$
(5)
figure a

\(\varvec{\widetilde{I} \leftarrow BuildIndex(K, D, \widetilde{V}, \varepsilon )}\): This algorithm is run by DU to generate encrypted indexes. Its inputs are the key K, the document set D, the encrypted document vector set \(\widetilde{V}\), and the fragmentation parameter \(\varepsilon \), and output the \(\widetilde{I}\). Procedures of this algorithm is shown in Algorithm 1 where DO first divides D into fragments and then builds an encrypted index for each fragment. Since the operations on each fragment are exactly the same after fragmented, the data preprocessing stage can be executed in parallel. Finally, DO outsource the \(\widetilde{D}\) and \(\widetilde{I}\) to CS.

3.3 Map-Reduce-Based Top-k Search

The Map-Reduce-based top-k search phase of PPTS is performed by DU and CS. DU generate a search trapdoor TD and submit it to CS. CS perform the Map operation according to TD to obtain the k documents most relevant to each fragment, and then perform Reduce operation to merge and rank the previously acquired documents to generate the final top-k results. This phase mainly contains two polynomial-time algorithms: GenTrapdoor and Search.

\(\varvec{TD \leftarrow GenTrapdoor (K, Q, k)}\): This algorithm takes a plaintext query containing the key K, the search keyword set Q, and the number of documents to be returned k, and outputs the encrypted query as a trapdoor TD. Its goal is to protect the keyword information in the query from CS. The construction process of TD as the following steps:

  1. (1)

    The query vector \(V_q\) is constructed according to Q. If \(w_i\in Q\), the IDF of \(w_i\) is stored in \(V_q[i]\), otherwise, the value of \(V_q[i]\) is 0. Then, according to the following formula, \(V_q\) is split into two vectors \(V_q^{'}\) and \(V_q^{''}\). Finally, \(V_q^{'}\) and \(V_q^{''}\) are encrypted with reversible matrices \(M_1\) and \(M_2\) to obtain the encrypted query vector \(\widetilde{V_q}=(M_1^{-1}V_q{'}, M_2^{-1}V_q{''})\).

    $$\begin{aligned} {\left\{ \begin{array}{ll} V_q^{'}[j]+V_q^{''}[j]=V_q[j], \quad S[j]=0\\ V_q^{'}[j]=V_q^{''}[j]=V_q[j], \quad S[j]=1 \end{array}\right. } \end{aligned}$$
    (6)
  2. (2)

    The hash-based message authentication code \(tag_i\) of \(w_i\) is calculated and constitutes the set \(T = \{tag_i\ |\ tag_i= hash(c,w_i) \wedge w_i\in Q\}\).

  3. (3)

    Output \(TD=(T, \widetilde{V}_q, k)\).

figure b

\(\varvec{RS \leftarrow Search (\widetilde{I}, TD)}\): When CS receives the trapdoor TD, it performs the top-k search in parallel on the basis of the indexes \(\widetilde{I}\), and then returns the result encrypted documents. The standard Map-Reduce model is adopted to find the top-k relevant documents. In the Map stage, local top-k result is obtained in each fragment. In the Reduce phase, all local top-k results are merged to obtain the global top-k result is calculated. Detailed procedures are shown in Algorithms 2 and 3.

figure c

According to the structure of index and the top-k search algorithms, we have that each encrypted inverted index for a fragment is independent and the top-k search follows the Map-Reduce model. Thus, the proposed search scheme is scalable. It means that, when the volume of outsourced data grows, the search efficiency can be preserved by adding servers.

4 Security Analysis

This chapter mainly elaborates PPTS from two aspects of security and efficacy. Security is to analyze the confidentiality of documents, indexes, and trapdoors. Efficacy is to analyze the scalability of PPTS, and prove that it has the capability of parallel search and can store large document sets.

Theorem 1

PPTS satisfies privacy requirements.

Proof

First, the symmetric encryption algorithm is used to encrypt documents in PPTS, which can protect the privacy of documents when the key is not leaked. Second, the security of the \(\widetilde{I}\) is guaranteed by random mapping of keywords, filling redundant values and matrix encryption. Because of the characteristics of the hash-based message authentication code, attackers cannot recover keyword information according to the codes. The document vectors and query vectors are encrypted by matrix encryption technology. Therefore, it can be fully proved that the indexes and trapdoors of PPTS are confidential. In addition, the problem of correlation between similar trapdoors can be solved by randomly adding redundant values to the search trapdoors.

Theorem 2

PPTS has parallel execution capability.

Proof

First, the indexes \(\widetilde{I}\) is designed according to the parallel computing framework Map-Reduce, that is, both Map and Reduce phases can be executed in parallel. Second, HDFS as a distributed file system can store large data. On the Hadoop cluster, DO do not need to pay attention to the details of data storage and transmission. It only needs to submit the \(\widetilde{D}\) and the \(\widetilde{I}\) to Hadoop to provide a secure, stable and effective search service for DU, which has very high practicability. Also, the execution capability can be linearly improved by server expansions, providing almost unlimited processing power.

Theorem 3

The accuracy and privacy of search are not affected by artificial paddings and rows.

Proof

We assume that \(\widetilde{P}_{j,s} = \, < id(d_s), \widetilde{V}_{d_s}{'}>\) is an artificial padding added to \(\widetilde{PL}_j\) corresponding to the keyword \(w_j\). \(\forall d_i \in F_{v,j}\) satisfies the following equation:

$$\begin{aligned} Score(V_{d_s}{'}, V_q) = V_{d_s} \cdot V_q = rand(min\{V_{d_i}[j]\}) \times V_q[j] < Score(V_{d_i}{'}, V_q) \end{aligned}$$
(7)

Therefore, the relevance score corresponding to the padding must be lower than any document in \(F_{v,j}\). When searching, DU only focuses on the documents with the highest relevance score, so the added paddings will not affect the accuracy of the search results. By adding paddings and rows, the posting list corresponding to each keyword is equal in length and to each fragment equal in width. Therefore, it is impossible to judge whether it is artificial padding or row based on the length of the posting list. Because \(d_s \notin F_{v,j} \wedge d_s \in F_v\), \(d_s\) has uniqueness and indistinguishability in posting list \(PL_j\). As a result, the added paddings will not affect the privacy of the search results.

5 Performance Evaluation

To evaluate the performance of PPTS, we implement it on the Hadoop platform and compared time cost with SPTS. Here, SPTS is a sequential privacy-preserving top-k search scheme running on a single server. In other words, SPTS use one server to search each \(\widetilde{I}_v \in \widetilde{I}\) sequentially. We extent the New York Times Dataset [19] to generate our experimental dataset which has 3,600,000 documents and 228,623 keywords are extracted. We implement the schemes using Java in Hadoop platform with three servers. Each server has 3.2 GHz, 8-core CPU, 16 G memory and 1 T hard disk. Default parameters are \(n=3{,}600{,}000,|Q|=15,\)\(k=5\), and \(\varepsilon =30{,}000\) which are the number of documents, search keywords, search documents and fragmentation parameter respectively.

In the following experiments, we evaluate the time cost of searches where one of the parameters nk, and |Q| changes and the other parameters adopt the default values. The results are shown in Figs. 4, 5 and 6.

Figures 4, 5 and 6 all show that the proposed PPTS outperforms SPTS in the time cost of ranked searches, and the former saves at least 80% of the time cost compared with the latter. The reason is that both PPTS and SPTS are based on the inverted index. In the inverted index, the number of candidate posting corresponding to search keywords is positively correlated with the number of documents and search keywords, but is not affected by the value of search documents. As the number of documents or search keywords increases, more resources are needed to calculate the relevance score. SPTS only use a single server with limited processing power, which can possibly reach its processing bottleneck and make the search speed slower and slower. However, PPTS use multiple servers to perform the search at the same time, and the task pressure is shared on multiple servers, so the time cost will not increase too much.

Fig. 4.
figure 4

Number of documents n \((\times 10^6)\)

Fig. 5.
figure 5

Number of search documents k

Fig. 6.
figure 6

Number of search keywords |Q|

6 Conclusion

In this paper, we propose a parallel privacy-preserving top-k search scheme over encrypted cloud data. In this scheme, the fragment-based encrypted inverted index is designed, which is indistinguishable and can be used for parallel searching. On the basis of such indexes, the Map-Reduce-based distributed computing framework is adopted and the parallel multi-keyword top-k search algorithms are proposed. Security analysis and experiment evaluation show that the proposed scheme is privacy-preserving, efficient and scalable.