Keywords

1 Introduction

Passwords are a widely used form of authentication online. However, one major weakness is that human chosen passwords can often be guessed by attackers. In fact, with the regular occurrence of leaks of password datasets [9], attackers are provided with an increasing amount of data to inform password guesses. It is important for security advocates and researchers to understand the capabilities of attackers given they have access to this data. This way, we can create countermeasures to protect the security of users.

Guessing passwords either involves formulating new words to try as guesses or using existing wordlists that include common password choices, words based on language dictionaries and datasets of previous password leaks. A human attacker who is guessing password will look for clues such as language, nationality and composition policies that might indicate a good wordlist to use in order to guess a password set, i.e. a set of unknown passwords. In this paper we are interested in investigating whether we can automate this learning and use it to inform wordlist choice. To our knowledge, this learning problem has not been studied before.

The order in which guesses are made can be important for a password guesser. Often they may only be able to make a small number of guesses before they will become locked out. Therefore, a quick learning strategy to maximise rewards is valuable. In this paper, we will show the speed of our learning strategy and compare it to the optimal rate of password compromise, a term we will discuss in more detail later.

Users choosing a password are known to be influenced by common factors. For example, users from similar demographics will often choose similar passwords [1, 6, 13]. In addition, users have been observed choosing passwords that reflect the nature of the website they are choosing the password for [17, 27]. In this paper, we investigate whether an automated learning algorithm can identify these idiosyncrasies within a password set and if it can leverage this knowledge in order to improve the success of password guessing rates.

In this paper, our contributions are as follows:

  • Our algorithm suggests guesses to be made against a population of users in an online or offline attack. After each guess, it uses the relative success of all previous guesses to identify how well each wordlist matches the passwords (see Sect. 3.1). Importantly, it requires no a-priori training.

  • In many previous wordlist approaches, a single ordered wordlist is created. In our method, wordlists are separated based on their source or characteristics. This allows for effective guessing from the promising wordlists, which is tailored to the characteristics of the password set.

  • Its adaptive nature allows it to react to new information and tailor the guessing strategy accordingly. We show that within 1–10 guesses the model can determine useful information about the password set. We also see that the guesses are relatively close to an optimal strategy.

  • Given a password set formed by users predominantly of a single nationality, our model can accurately recognise this characteristic and tailor the guessing to use an appropriate wordlist. We also show that this improves guessing, revealing that choosing a wordlist made up of other users from that same nationality can improve guessing over using a general wordlist (even when language differences are not a factor).

In this paper, we describe our full multi-armed bandit model and demonstrate its effectiveness through simulations and real-world guessing. We begin in Sect. 2 with an overview of related work. Then, in Sect. 3, the multi-armed bandit (MAB) problem is placed in the context of password guessing. In this section, we also describe the set-up of the Maximum Likelihood Estimation design and validation. In Sect. 4, we investigate the general guessing performance of the multi-armed bandit model. Section 5 shows that the MAB can identify demographic information and leverage this for guessing. We discuss our overall results in Sect. 6 and conclude in Sect. 7. The authors also include all their code for implementing the multi-armed bandit in the following repository [16].

2 Related Work and Background

For a long time researchers have been interested in modelling and improving password guessing. The first strategic methods involved dictionary attacks (that is, using a set wordlist to make guesses). These were proposed by Morris and Thompson in 1979 [15] and are still widely used today in the form of John the Ripper [19] and HashCat [25].

Developing on the simple dictionary method, in 2005, Narayanan and Shmatikov employed Markov models to enable faster guessing [18]. A Markov model can predict the next character in a sequence based on the current character. In 2009, Weir et al. used probabilistic context-free grammars (PCFG) to guess passwords [28]. PCFGs characterise a password according to its “structures”. Structures can be password guesses or word mangling templates that can take dictionary words as input. In 2013, Dürmuth et al. proposed an updated password guessing model based on Markov models, called OMEN [4]. As part of their initial paper they demonstrated an OMEN specific method for merging personal information with a wordlist of guesses [2].

In 2016, Wang et al. developed a targeted password guessing model which seeds guesses using users’ personally identifiable information [26]. Wang et al. leverage existing probabilistic techniques including Markov models and PCFG as well as Bayesian theory. They create tags for specific personally identifying information (PII) associated with a user. In their most successful version, TarGuess-I, they use training with these type-based PII tags to create a semantic aware PCFG. Independently, Li et al. also created a method for seeding password guesses with personal information. Their guessing also extended the probabilistic context free grammar method [11].

Also in 2016, the use of artificial neural networks for password guessing was proposed by Melicher et al. [14]. Artificial Neural networks are computation models inspired by biological neural networks. Artificial neural networks are a machine learning technique particularly useful for fuzzy classification problems and generating novel sequences (such as a password not in the training data). Melicher et al. show that their neural network method can be more effective than both Markov and PCFG methods. In addition, because neural networks can be highly compressed, they show that they can be used to efficiently carry out client-side password strength checking. In 2017, Houshmand and Aggarwal created a method for merging multiple grammars for wordlist-based PCFG models [8].

In 2019, Hitaj et al. proposed using deep generative adversarial networks (GAN) to create password guesses [7]. A generative adversarial network pits one neural network against another in a zero-sum game. PassGAN is able to autonomously learn the distribution of real passwords from leaked passwords and can leverage these to generate guesses. In contrast to Markov and PCFG models, PassGAN does not require a-priori knowledge of password structures.

In 2018, Xia et al. suggested a deep learning model which combines PCFG with the neural network LSTM [29]. This method, called GENPass, was designed to overcome the limitation of neural networks that means they can not, in their raw form, be used for cross-site attacks. This was an important contribution as leveraging passwords leaked from one website to use to guess the same users’ passwords on another site is a common attack. Pal et al. in 2019, developed a password manipulation tool called PASS2PATH [20]. Leveraging the knowledge that users alter and reuse their passwords, this model can transform a base user password into targeted password guesses.

Probably the most similar to our work is work by Pasquini et al. [21, 22]. They introduced the idea of “password strong locality” to describe the grouping together of passwords that share fine-grained characteristics. This password locality can be leveraged to train their learning model to generate passwords that are similar to those seen and to help with password guessing. Our model differs from previous work in that previous work has tried to create effective words to be guessed against passwords and has used learning techniques to inform how to create these words. In our work, we assume lists of guesses exist in the form of multiple wordlists, our learning technique informs which wordlist will be most effective for guessing the particular password set, and how to combine guesses from multiple wordlists in order to utilise them effectively.

Users’ passwords are inherently guessable as they often take predictable forms which are impacted by the users’ language and the website they were created for. In 2012, Malone and Maher investigated passwords created for different websites and found that nationality plays a role in user’s choice of passwords [13]. They also found that passwords often follow a theme related to the service they were chosen from. For example, a common LinkedIn password is “LinkedIn”. This result was further investigated by [27].

Researchers have also studied password sets which are in particular languages. Sishi et al. studied the strength and content of passwords derived from seven South African languages [24]. Li et al. completed a large scale study of Chinese web passwords [12]. Weir et al. used Finnish passwords as a basis for studying the use of PCFG in the creation of password guesses and mangling rules [28]. Dell et al. [3] included both Italian and Finnish password sets and guessed them using English, Italian and Finnish wordlists. They draw conclusions on the general strength of the passwords and the diminishing returns nature of guessing. In this work, we have chosen password sets from three nationalities: Irish, English and German. Despite Irish and English users both being predominantly English speaking, we were able to show that leveraging the nationality to choose a relevant wordlist, still had an impact on improving the guessing.

3 The Multi-armed Bandit Problem

The multi-armed bandit problem describes the trade-off a gambler faces when faced with a number of different gambling machines. Each machine provides a random reward from a probability distribution specific to that machine. The crucial problem the gambler faces is how much time to spend exploring different machines and how much time to spend exploiting the machine that seems to offer the best rewards. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls.

In our scenario, we regard each wordlist as a bandit which will give a certain distribution of successes. We want to explore the returns from each wordlist and also exploit the most promising wordlist, in order to make effective guesses. With each guess we learn more about the distribution of the password set we are trying to guess. Leveraging this knowledge, we can guess using the wordlist that best matches the password set distribution, thus maximising rewards.

3.1 Password Guessing: Problem Set-up

Suppose we have n wordlists. Each wordlist \(i = 1 \ldots n\), has a probability distribution \(p_i\), and \(\sigma _i(k)\) denotes the position of password k in wordlist i. So, the probability assigned to password k in wordlist i is \(p_{i,\sigma _i(k)}\).

Suppose we make m guesses where the words guessed are \(k_j\) for \(j = 1 \ldots m\). Each of these words is guessed against the N users in the password set and we find \(N_j\), the number of users’ passwords compromised with guess number j.

To model the password set that we are trying to guess, we suppose it has been generated by choosing passwords from our n wordlists. Let \(q_i\) be the proportion of passwords from wordlist i that generated the password set. Our aim will be to estimate \(q_1, \dots , q_n\) noting that

$$\begin{aligned} \sum _i^n q_i = 1 \qquad \hbox {and} \qquad q_i \ge 0. \end{aligned}$$
(1)

This means that the \(q_i\) are coordinates of a point in a probability simplex. If the password set was really composed from the wordlists with proportions \(q_i\), the probability of seeing password k in the password set would be

$$\begin{aligned} Q_k := \sum _{i=1}^{n} q_i p_{i,\sigma _i(k)}. \end{aligned}$$
(2)

Given the \(N_j\), we will use \(Q_k\) to build a maximum likelihood estimator.

3.2 Maximum Likelihood Estimation

Given this problem set-up, we will construct a likelihood function which will describe the likelihood that a given set of parameters \(q_1, \dots , q_n\) describe the password set. In this section, we introduce the likelihood estimator, prove that a unique maximum value exists and demonstrate convergence to this maximum.

Likelihood Function. We construct the following likelihood for our model with m guesses:

$$\begin{aligned} \small \mathcal {L} =&\left( {\begin{array}{c}N\\ N_1 \, \cdots \, N_m \, (N - N_1 \cdots -N_m)\end{array}}\right) \, Q_{k_1}^{N_1} Q_{k_2}^{N_2} \cdots Q_{k_m}^{N_m} \nonumber \\&\qquad {\times } \,\left( 1- Q_{k_1} \cdots - Q_{k_m} \right) ^{N -N_1 \cdots - N_m}, \end{aligned}$$
(3)

where the first term is the multinomial coefficient representing each way the remaining guesses could be structured. The second term denotes how many times password \(k_j\) is expected to be seen in the password set, \(Q_k\), to the power of how many times it was actually seen. The final term represents the remaining guesses and states that they account for the remaining users’ passwords in the password set that have not yet been compromised.

Our goal is to maximise this likelihood function by choosing good estimates for \(q_1, \ldots q_n\) based on our observed rewards from each previous guess. Note, with each guess we learn more about \(q_i\) for all the wordlists. In fact, one of the features of this model compared to a traditional multi-armed bandit model is that when we make a guess we learn something about all the wordlists.

We can take the \(\log \) of the likelihood function to create a simplified expression. In addition, we can remove the multinomial which is simply a constant for any values of \(\vec {Q}\). This leaves us with:

$$\begin{aligned} \small \log \mathcal {L} =&\hbox { const} + N_1 \log {Q_{k_1}} {+}\,N_2 \log {Q_{k_2}} \cdots + N_m \log {Q_{k_m}}\nonumber \\&{+}\, (N -N_1- \cdots - N_m) \, \log { \left( 1- Q_{k_1}- \cdots - Q_{k_m} \right) }. \end{aligned}$$
(4)

We will show that the log-likelihood function, \(\log \mathcal {L}\), is concave. This means that the likelihood function has a unique maximum value [23], making it a good candidate for numerical optimisation.

Theorem 1 (Concavity of log likelihood function)

The log likelihood function \(\log \mathcal {L}\) for \(\mathcal {L}\) defined in Eq. 3 is concave.

Proof Required to prove that \( \log \mathcal {L}(\alpha \vec {q} + (1-\alpha )\vec {r}) \ge \alpha \log \mathcal {L}(\vec {q})+(1-\alpha )\log \mathcal {L}(\vec {r}).\) We begin by simplifying the notation used in the Likelihood function by converting to vector notation.

Let \(g(\vec {Q}) = \displaystyle \sum _{j=1}^{m+1} N_j \log {Q_j}\) where \(N_{m+1} = N - \displaystyle \sum _{j=1}^{m} N_j\) and \(Q_{m+1} = 1 - \displaystyle \sum _{j=1}^m Q_{j}\).

As before (Eq. 3), \(Q_{j} = \displaystyle \sum _{i=1}^{n} q_i p_{i,\sigma _i(k_j)}\) and therefore \(Q_{m+1} \,{=}\, 1- \displaystyle \sum _{j=1}^m \displaystyle \sum _{i=1}^n q_i p_{i,\sigma _i(k_j)} {=}\, \displaystyle \sum _{i=1}^{n} q_i (1 - \displaystyle \sum _{j=1}^m p_{i,\sigma _i(k_j)}).\)

Observe, we can define a matrix P so \(\vec {Q} = P\vec {q}\) and \(\log \mathcal {L}(\vec {q}) = \displaystyle \sum _{j=1}^{m+1} N_j \log Q_j = g(\vec {Q}).\)

Therefore, we can now rewrite

$$\begin{aligned} \log \mathcal {L}(\alpha \vec {q} + (1-\alpha )\vec {r})&= g\left( P(\alpha \vec {q} + (1-\alpha )\vec {r})\right) =g\left( \alpha P \vec {q} + (1-\alpha )P\vec {r}\right) \\&=\sum _{j=1}^{m+1} N_j \log (\alpha (P\vec {q})_j + (1-\alpha )(P\vec {r})_j) \ge \displaystyle \sum _{j=1}^{m+1} N_j ( \alpha \log (P\vec {q})_j + (1-\alpha ) \log (P\vec {r})_j)\\&=\alpha \displaystyle \sum _{j=1}^{m+1} N_j \log (P\vec {q})_j + (1-\alpha ) \displaystyle \sum _{j=1}^{m+1} N_j \log (P\vec {r})_j= \alpha g(\vec {Q}) + (1-\alpha )g(\vec {R}) \\&= \alpha \log \mathcal {L}(\vec {q})+(1-\alpha )\log \mathcal {L}(\vec {r}). \end{aligned}$$

As the log-likelihood function is concave, there will be a unique maximum likelihood value. We will use gradient descent to find the \(q_i\) that maximises \(\mathcal {L}\) after m guesses subject to the constraints (1). We iteratively change the estimated \(q_i\) values, \(\hat{q}_i\).

Gradient Descent. As we apply iterations of gradient descent to estimate the parameters \(q_1, \dots , q_n\) which maximise the likelihood function, we must maintain the constraints of the system. In particular,

$$\begin{aligned} \sum _{i=1}^n q_i = 1 \qquad \hbox {and} \qquad q_i \ge 0. \end{aligned}$$

To meet these constraints, we project the gradient vector onto the probability simplex and then adjust our step size so that we stay within that space.

With each iteration of our gradient descent we move a step in the direction which maximises our likelihood function. The gradient is scaled by a factor \(\alpha \) (described in Sect. 3.3) to give a base step size. This is further scaled by an amount \(\beta \) to ensure the move from \(\vec {p}\) to \(\vec {p} + \alpha \beta \vec {g}\) satisfies \(\beta \le 1\), \(\beta ||\vec {g}|| \le 1\) and \(\vec {p} + \beta \vec {g}\) lies with in the simplex.

Gradient Descent Validation. The goal of the gradient descent is to converge towards the maximum of the likelihood function and thus find the proportions \(q_i\) that provide the best explanation of the distribution of the password set seen after m guesses. For initial validation of the gradient descent performance we take four different leaked password datasets as wordlists.

The datasets we used are leaked passwords from users of hotmail.com, flirtlife.de, computerbits.ie and 000webhost.com. They contain 7300, 98912, 1795 and 15, 252, 206 users’ passwords respectively. The datasets were compromised by various methods so the lists may only contain a random, and possibly biased, sample of users [17]. As far as we can tell only the 000webhost dataset imposed composition policies on users’ passwords [5].

We took a random sample of 1000 users’ passwords from the 98912 users in the Flirtlife dataset. In [17], it was shown that when guessing a sample of leaked passwords from a website, the most effective guesses come from the passwords of other users of that same site. Therefore, if gradient descent is effective we expect it to show that the sample most closely compares to the Flirtlife wordlist.

In Fig. 1, we show the q value estimates during the convergence of the gradient descent for the likelihood function seeded with just one guess. This single guess was the password 123456 because it is widely considered to be the most commonly used password. We began by setting the \(\hat{q}\)-values to \(1/n = 0.25\), and then used 100 steps of gradient descent with a scaling of \(\alpha = 0.1\) on the step size to estimate the proportions. We recorded the \(q_i\) values that during the gradient descent gave the maximum Likelihood value. This way we remember the best value if we overstep the maximum.

The password 123456 occurred in all four password sets but using the distribution of those datasets the likelihood function was able to determine that the proportion in the sample best matched the proportion in Flirtlife. If we were guessing the full Flirtlife dataset with several guesses, rather than just a sample from it with one guess, then this proportion will be closer to 100%.

Fig. 1.
figure 1

Estimating the distribution of a password set using information from 1 guess and \(\alpha = 0.1\).

In the above example, all 1000 passwords came from a random sample of Flirtlife. We will investigate in later examples whether the maximum likelihood estimation can determine the breakdown of where passwords come from when composed of different wordlists.

3.3 Multi-armed Bandit Design and Validation

Let us now suggest some choices faced when designing a password guessing multi-armed bandit. We need to choose how to repeatedly apply our gradient based maximum likelihood estimation and also we must say how we will choose guesses. We are interested in which of these variations produces the best results.

Gradient Descent Initialization Variables. We expect the gradient descent to improve with each guess made since every guess provides it with more information. There are a number of ways of initialising the gradient descent after each guess provides new information. The following are three different methods for choosing the initialisation value:

  • Random. Randomly pick starting values for \(\hat{q}_i\), subject to Eq. (1),

  • Average. Choose the average starting value, i.e. assume the passwords are uniformly distributed between the n wordlists, so \( \hat{q}_i = 1/n\),

  • Best. Use our previous best estimate for the \(\hat{q}\)-values, based on the gradient descent results for the previous guess.

Gradient Descent Base Step Size. We looked at a number of techniques for choosing our base step size \(\alpha \) for gradient descent, including a constant value of \(\alpha \), an \(\alpha \) that resulted in a constant \(l_2\) step size and an adaptive method. In supplementary material for this paper [16], we demonstrate that a constant \(\alpha =0.1\) was a reasonable choice for our model.

Informing Our Guess Choices. Once we have generated our estimate of the \(\hat{q}\)-values, we want to use them to inform our next guess. We suggest three options for how to choose our next guess:

  • Random. Randomly choose a wordlist and guess the next most popular password in that wordlist.

  • Best wordlist. Guess the next most popular password from the wordlist with the highest corresponding \(\hat{q}\)-value.

  • Q-method. Use information from the \(\hat{q}\)-values combined with the frequencies of the passwords in the wordlists to inform our next guess.

These options have different advantages. In the first option, we randomly choose a wordlist to guess from, but we are still taking the most probable guess from the wordlist we choose. This option emphasises the continued exploration of all the wordlists. In the second option, we are choosing the wordlist we believe accounts for the largest proportion of the password set.

The last option is specifically basing password guess choices using Eq. (2). It uses our predicted \(\hat{q}\)-values to estimate the probability of seeing each word k. If, for example, we have a word k which has frequency \(f_1(k)\) in wordlist 1 but also occurs in wordlist 2 and 3 with frequencies \(f_2(k)\) and \(f_3(k)\) respectively. Using (2), where \(p_{i,\sigma _i(k)}= f_i(k)/{\textit{size of wordlist }} i \), we can compute the total probability of this word occurring in the password set. This method should determine which word k has the highest probability of being in the password set and use this word as our next guess.

We will now look at some examples of the performance of our multi-armed bandit model against simulated password sets. Guessing against simulated password sets allows us to identify whether the multi-armed bandit model is capable of identifying the characteristics of a password set (i.e. the \(q_i\)) with synthetic data. It also allows us to compare and contrast the effectiveness of the model variables.

In these simulations we guess one word at a time and then compute the estimated weight of each wordlist. We also report separately the number of users compromised after each guess is made. If the scheme is effective it should be able to approximate the true distribution of the password set. We also expect to find that the “Q-method” of guessing is more effective than a random wordlist choice. We may also find that it is better or as good as simply guessing using the estimated “best” wordlist.

In each of the below plots we have used the constant alpha method for computing the gradient step size and set this value to 0.1. We also show the q value estimates plot for the best combination of initialisation and guess choice methods. We discuss the alternative methods in supplementary material [16].

Validation Password Set 1: 60% Flirtlife, 30% Hotmail, 10% Computerbits. We begin by creating a password set made up of 10, 000 users’ passwords; 60% were selected randomly from the flirtlife.de dataset, 30% from the hotmail.com dataset and 10% from the computerbits.ie dataset.

In Fig. 2(l), we plot the estimated q-values after the gradient descent was completed for each guess. For this graph, the gradient descent was initialised using average \(\hat{q}\)-values, \(\hat{q}_i = 1/3\), and the Q-method was used for guessing. The actual proportions are shown as solid horizontal lines. Even after a small number of guesses we have good predictions for how the password set is distributed between the three wordlists.

In Fig. 2(r), we show the number of users successfully compromised as the number of guesses increases. The successes are the average over fifty runs to reduce the variance in the random guessing method. Results are shown for each combination of initialisation and guessing method. As one might expect, picking guesses from a random wordlist resulted in the lowest success rates. Both the Q-method and guessing from the best wordlist resulted in successes close to the optimal line. After 100 guesses these methods had compromised 795 users, in comparison to the 870 users compromised by guessing the correct password in the correct order for every guess.

Validation Password Set 2: 60% 000webhost, 30% Hotmail, 10% Computerbits. In Fig. 3(l), we show the estimated q-values for a 10, 000 user password set made from 000webhost, Hotmail and Computerbits with a 6:3:1 split. Again, we get good estimates for the q-values. In Fig. 3(l), 000webhost is accurately weighted as accounting for the largest proportion of password set. However, in Fig. 3(r) when we guess solely from the best ranked wordlist (dashed lines) we get lower guessing returns than when we randomly choose a wordlist to guess from (dotted lines). While we believe this is a consequence of the 000webhost, unlike the other password sets, being constrained by composition restrictions [17], it does highlight an important distinction between optimising our model for effective guessing and optimising to best represent the characteristics within the password set.

Fig. 2.
figure 2

Validation password Set 1. Left: q-value estimates (initialization: average \(\hat{q}\)-values, Guessing: Q-method). Right: Guessing returns for validation.

Fig. 3.
figure 3

Validation password Set 2. Left: q-value estimates (initialization: average \(\hat{q}\)-values, Guessing: Q-method). Right: Guessing returns for validation.

It is worth noting that the Q-method of guessing would also be influenced by the high ranking of 000webhost passwords and their low guessing success. However, it still performs slightly better than the random method, and significantly better than guessing from the best wordlist (avg. results over 100 trials).

Validation Password Set 3: 60% Hotmail, 30% Flirtlife, 10% Computerbits. In Fig. 4(l), we show the estimated q-values for a 10, 000 user password set made from Hotmail, Flirtlife and Computerbits with a 6:3:1 split. The approximation for the Computerbits wordlist falls slightly below the correct level and the Flirtlife estimate is slightly above. The approximation for the strongest wordlist, Hotmail, is accurate. The estimates have mostly converged by guess 10 and there is little divergence after that point.

Fig. 4.
figure 4

Validation password Set 3. Left: q-value estimates (initialization: average \(\hat{q}\)-values, Guessing: Q-method). Right: Guessing returns for validation.

Figure 4(r) shows the guessing success rate for this password set. We see that the Q-method fares better than the random and best wordlist methods. In fact, it is close to the optimal guessing method. By the end of the guessing the Q-method has compromised an average of 1106 users. An optimal strategy at this point would have compromised 1223 users. The best wordlist method compromised an average of 980 users and random compromised an average of 962 over 5 runs. We see little difference in success rates for the different initialisation methods within each guess choice.

Validation Password Set 4: 55% Hotmail, 30% Flirtlife, 10% 000webhost, 5% Computerbits. The final password set we look at is composed of 10, 000 users’ passwords from 4 different wordlists. In Fig. 5(l), we display the estimated q-values. The model gives an accurate approximation of the q-values. Figure 5(r), shows the successes when guessing this password set. Again, we see that the Q-method is effective at guessing, this time performing significantly better than the other guessing methods. We notice that the successes are close to the optimal. Particularly for the first 20 guesses, the Q-method compromised 303 users in comparison to 317 compromised by optimal guessing.

Fig. 5.
figure 5

Validation password Set 4. Left: q-value estimates (initialization: average \(\hat{q}\)-values, Guessing: best wordlist). Right: Guessing returns for validation.

Summary of Results for Validation Password Sets. The multi-armed bandit scheme is able to match characteristics in a password set to characteristics in the wordlists used for guessing. We have seen that for a variety of synthetic examples, guessing using the multi-armed bandit technique can be effective both for compromising users and estimating how the passwords have been chosen.

In all examples we saw that guessing using the Q-method is consistently effective in comparison to other wordlist selection methods. In general, we found that the initialization method had little bearing on the success results. This stems from our concave log-likelihood function, meaning that, for most set-ups, we converge to a single maximum when estimating the distributions.

These initial results demonstrate that the relationship between password choice and user cohorts is tangible and identifiable by automation. We are now motivated to investigate whether the multi-armed bandit can offer efficient guessing returns when guessing a real password set.

4 Password Guessing for Real Password Sets

Given a set of leaked passwords, that we have no a-priori knowledge about, it is unlikely that the password data set is exactly a weighted combination of datasets that we have already seen. However, we can still assess whether the multi-armed bandit can learn which wordlists to choose guesses from in order to guess efficiently. In this section, we investigate which of our methods for choosing guesses from the wordlists are most effective. In particular, we expect that the Q-method of choosing passwords between wordlists could offer a guessing improvement over both a random choice of wordlist and choosing from the predicted “best” wordlist.

Recall that the Q-method uses the weighting of wordlists and the proportion of each password in those wordlists to decide on the next guess.

For this investigation we used two leaked password sets. The 2009 rockyou.com password leak which included 32 million plaintext user credential and the 2012 yahoo.com Yahoo Voices password leak which included 453, 492 plaintext users’ passwords. These are old password sets that have been well studied in the literature and therefore offer effective comparison between guessing strategies and allow easy replication of our results. In this paper, we show the results for the Rockyou data, though similar results were seen for Yahoo.

4.1 Rockyou.com Password Set

In this section we describe the guessing of the Rockyou password set. Four wordlists were used: Computerbits, Hotmail, Flirtlife and 000webhost.

Figure 6(l) shows the estimated breakdown of Rockyou between the four wordlists. Hotmail is assigned the highest rating with 000webhost, flirtlife and computerbits falling below it respectively. In terms of the breadth of the audience demographic in each of the wordlists, this assessment of the breakdown seems logical. The nationality specific websites such as computerbits.ie and flirtlife.de fall lowest and 000webhost.com, which enforces composition restrictions, fares slightly worse than hotmail.com.

Fig. 6.
figure 6

Rockyou. Left: q-value estimates (initialization: average \(\hat{q}\)-values, Guessing: Q-method). Right: guessing returns for rockyou.com

Figure 6(r) shows the guessing successes for the Rockyou password set guessed using the four wordlists. There is little differentiation between the initialisation methods. However, the guess choice method significantly impacts the number of successes. The optimum number of successes for 100 guesses against the Rockyou password set is 1, 483, 668 (100% of optimum, 4.55% of total users) compromised. The Q-method compromised an average of 945, 371 (64% optimum, 2.9% total) users. Choosing from the estimated best wordlist compromised 846, 772 (57% optimum, 2.6% total) users on average, and choosing a random wordlist resulted in the lowest number of average successes at 781, 164 (53% optimum, 2.4% total). We can see the Q-method performs better than the next best method by compromising just under 100, 000 more users.

Comparison to Single Wordlist Guessing. Here we investigate the value of using a multi-armed bandit, over simply guessing using each wordlist individually. In Fig. 7, we compare using the Q-method (solid purple line) to guessing using each wordlist separately. The Q-method performs well, compromising 945, 371 (64% optimum, 2.9% total) users in comparison to 804, 731 (54% optimum, 2.5% total), 703, 041 (47% optimum, 2.2% total), 603, 783 (41% optimum, 1.9% total) and 64, 024 (4.3% optimum, 0.2% total) from Flirtlife, Hotmail, Computerbits and 000webhost respectively.

Fig. 7.
figure 7

Single wordlist guessing returns for rockyou.com versus multi-armed bandit

Compare the ordering in Fig. 6(l) to that in Fig. 7. Notice that, the weightings assigned to the wordlists do not necessarily correspond to a better guessing result when the wordlists are used individually. This is because our multi-armed bandit has been designed with the goal of matching characteristics not optimising guessing. It will give rewards to wordlists if the password set does not contain a password and a wordlist also does not contain the password. But this is not necessarily indicating that this wordlist will be better at guessing, only that it is a good match. If optimising guessing-returns is the goal, then we suggest experimenting with our model in order to weight successes more than failures.

5 Identifying Demographics

Though real password data may not be simple combinations of previously seen wordlists, it is possible that the weights estimated by our scheme reveal information about the demographics of the users in the dataset. It is well known that users’ demographics, such as nationality and language, play an important role in their password choices [1, 6, 13]. Indeed, this is information that human password guessers might look for when determining their guessing strategies. Specifically, we will consider whether, given a password set formed by users predominantly from a single nationality, the multi-armed bandit can recognise which wordlist best matches this locality? Does using passwords generated by other users from that same nationality improve guessing?

5.1 Matching Nationality Characteristics

We test the ability of our scheme to identify national characteristics using two password sets and two nation-specific wordlists. We have chosen the password sets to be Irish and German users. Irish users are mainly English-speaking and both English and German are Indo-European languages using the Latin alphabet. Our challenge will be to see if we can identify the nationality of a password set by linking it to a nation-specific wordlist (in preference to international wordlists). Clearly, a simple method could tell the difference between, say, Chinese and English passwords. We are interested in the more challenging setting of distinguishing between Irish users’ passwords and English users’ passwords when the spoken language is the same, or between English and German passwords where both use the Latin alphabet. In this section, we will show that our learning methods are able to identify these subtle distinctions.

The two password sets we will try guessing are the computerbits.ie password set and the flirtlife.de password set. Computerbits.ie is made up of 1785 Irish users. Flirtlife.de is made up of 98, 912 predominantly German and Turkish users. The two wordlists were drawn from the large set of 31 password leak datasets known as Collection #1 [10]. One of these password sets was selected and from this we extracted all the passwords whose corresponding email address contained the country code top-level domain “.ie” and separately “.de”. These formed our nationality specific user wordlists from Ireland and Germany with 90, 583 and 6, 541, 691 users respectively.

Irish Passwords. We are interested in whether the multi-armed bandit will match the distribution of the Irish password set computerbits.ie to the extrapolated Irish wordlist taken from the subset of Collection #1 (denoted “Irish users” from now on). To test this we ran the multi-armed bandit set-up as per the optimal parameters found in Sect. 3.3.

In Fig. 8(l), we included three wordlists, the hotmail.com leaked passwords, the flirtlife.de password set and the Irish users. Hotmail.com is an international website. However, it is suspected that the Hotmail users in the dataset we have were compromised by means of phishing scams aimed at the Latino community. Flirtlife is a dating site with predominantly German and Turkish users. Figure 8(l) plots the breakdown estimated by the multi-armed bandit. From the first guess it estimates that the passwords in the computerbits.ie set match closely to the passwords chosen by the Irish users. Notice that some weighting is assigned to the Hotmail wordlist but essentially none to the flirtlife.de password set.

Fig. 8.
figure 8

q-value estimates for the Irish password set from Computerbits.i.e. Left: estimated using three wordlists. Right: estimated using four wordlists.

In Fig. 8(r), we included four wordlists. The additional wordlist is the Rockyou.com password set leaked in 2009. It includes 32 million users’ passwords and had an international audience. The language used on Rockyou applications was English. Given the common spoken tongue in Ireland is English and that the Rockyou password set is often used as an effective base to seed guessing, we expect the Rockyou users to be somewhat representative of the Irish Computerbits users.

Figure 8r shows the estimated breakdown for the computerbits.ie passwords. In the beginning Rockyou is assigned a weighting nearly as high as the Irish users. However, the value of Rockyou declines as the number of guesses increases. In all combinations of initialisation and guess methods, the multi-armed bandit was able to identify that the Computerbits passwords most closely matched the Irish subset of users.

German Passwords. We now try to guess the flirtlife.de password set using the wordlist of German users. While flirtlife.de is a German dating site, its main users were both German and Turkish. Therefore, we expect there to be a less strong link between the German wordlist and the flirtlife.de passwordset than there was for computerbits.ie and the Irish wordlist.

Figure 9 plots the flirtlife.de weightings for three wordlists: German users, Rockyou.com and hotmail.com. Up to 50 guesses, most weighting is assigned to the Rockyou wordlist. However, after 50 guesses, the German users’ passwords overtake Rockyou and remain slightly ahead up to at least guess 200.

The multi-armed bandit was still able to identify that the Flirtlife passwords best matched the German users wordlist. However, the effect does not take place until the high frequency passwords, up to 50, have been guessed. Rockyou is a large password set and will generally give a good indication of passwords chosen by a general population [17]. Because Flirtlife is made up of users from two nationalities and languages: German and Turkish, it is possible that the value that a solely German dictionary offers is not enough to counteract the general guessing strength of the Rockyou wordlist.

Fig. 9.
figure 9

q-value estimates for the German password set from flirtlife.de estimated using three wordlists.

5.2 Password Nationality to Inform Guessing

We saw that the multi-armed bandit can link a password set to a wordlist based on characteristics within the passwords and reveal the nationality of the users. Does using passwords generated by other users of the same nationality improve guessing?

Irish Users. In Fig. 10(l), we guess the passwords in the Irish computerbits.ie password set. The black line shows the returns for an optimum first 100 guesses. We also guess them using the order and passwords from the full Collection #1 password set that the Irish users and German users were chosen from. We label this full dataset “all users”. We made 100 guesses against the 1795 users in the Computerbits password set. The top 100 most popular words were chosen in order from each wordlist. The wordlist composed of only Irish users performed better at guessing than the wordlist with all users’ passwords in it. We also include the guessing success for our multi-armed bandit model. It performs as well as guessing using the Irish users set.

Fig. 10.
figure 10

Compares successes between guessing using a full wordlist of passwords and just those passwords belonging to users of the same nationality. Left: Guessing the Irish password set Computerbits. i.e. Right: Guessing the German password set Flirtlife.de.

German Users. In Fig. 10(r), we guess the Flirtlife password set using two wordlists similar to above. We can see that using just the German users’ passwords ranked in order, is more effective than using all users passwords. The multi-armed bandit model performs better than simply using the distribution of all users’ passwords to rank and order guesses.

6 Discussion and Future Work

For synthetic password sets, our multi-armed bandit was able to identify the wordlist that best linked to the passwordset it was guessing. This identification was achieved often within the first 10 guesses. We also saw that the scheme could automatically identify the sort of demographic information, such as nationality, that a password cracker would use to identify suitable wordlists, suggesting that the multi-armed bandit has the potential to be as good as a human at wordlist selection.

We see at least three potential offensive use-cases for such a guessing model. 1. The first approach is the most direct and utilises the real-time convergence of the MAB. An online guesser guessing a selection of users passwords from a website, will learn from each success and use it to inform the next guess made against all users. 2. An attacker could gather information by applying the multi-armed bandit to an offline leaked dataset of users from a given organisation. They could use MAB to determine the optimum choice of wordlist and then could carry out a tailored attack on other users from the same organisation. This has the potential to be effective as passwords created by users of the same organisation can significantly improve guessing returns [17]. 3. This same approach could be used in an online attack where a selection of users are used to learn characteristics, guessing until the accounts are locked. Once the MAB has highlighted the appropriate wordlist, then the wordlist can be used against other users, avoiding triggering lockout on potentially more valuable users.

This guessing model provides evidence for the importance of guiding users away from passwords which reflect nationality or website. It also demonstrates that passwords differ measurably depending on their source use. This indicates that websites should consider blocklisting passwords in a way that is tailored to their particular subject matter and users. In particular, websites who have experienced previous password leaks could work at restricting future users from using passwords which occurred with a high frequency in that leak.

We believe there is potential to further apply and expand this work. For example, the multi-armed bandit might be used to identify password policies enforced by matching them to datasets created under different composition policies. The multi-armed bandit could also be extended so that it does not depend on the exact probabilities for the words in a wordlists, and so then would work with guesses seeded from other sources, say based on users’ personal information. Finally, it would be interesting to evaluate the system when used with a larger number of wordlists to assess how it scales.

7 Conclusion

Our multi-armed bandit model has proven effective in learning which wordlists best represent the composition of a set of passwords. It was therefore able to identify which wordlist would provide the most effective password guesses. This also allows it to learn features of a password set, allowing insight into the composition rules enforced, the website a leak originated from, the nationality of the users, or other characteristic information. The scheme demonstrates that this information gathering can now be done in an automated way and no longer requires a “human-in-the-loop”.