1 Introduction

Learning complex models by combining several simpler models is at the heart of ensemble methods. The Mixture-of-Experts (MoE) paradigm provides an attractive and principled, probabilistic framework for constructing such ensembles by combining simple “experts” via probabilistic mixture models (Yuksel et al. 2012; Masoudnia and Ebrahimpour 2014). Each expert in the MoE handles a small region of the data space where the data-to-expert assignment is controlled by a gating network. The gating network essentially learns a flat/hierarchical partitioning of the input space. The MoE paradigm has been successfully used to design nonlinear models by combining simpler linear expert models. Note however that the MoE setting does not restrict the experts to be linear models, e.g., each expert can be a Gaussian Process (Meeds and Osindero 2006).

As compared to the other dominant paradigm for nonlinear learning, i.e., kernel methods (Shawe-Taylor and Cristianini 2004), the MoE paradigm is often more appealing due to several reasons: (1) MoE has an inherently probabilistic formulation which enables a probabilistic/fully Bayesian treatment, (2) MoE based methods are faster at training and test time while also being efficient in terms of space since they do not have to deal with kernel matrices; and (3) MoE based methods provide nice interpretability—for example, a nonlinear classifier learned by a MoE can be thought of as a combination of several linear classifiers. Finally, MoE has also been gaining considerable attention lately in the design of very large neural networks (Shazeer et al. 2017) due to MoE’s inherent property of conditional computation, i.e., only a part of the model is active for a given input.

Despite these appealing properties, the MoE architectures are known to be considerably difficult to train as they usually lack closed form parameter updates. Note that learning a MoE model requires learning the parameters of each expert in the mixture, as well as the parameters that define the gating network. A typical approach to learn MoE models is to formulate them as latent variable models and use the Expectation–Maximization (EM) (Jordan and Jacobs 1994) algorithm. In the EM algorithm for MoE, the latent variables denoting the input-to-expert (soft) assignments are inferred in the E step and the parameters defining the experts, and the gating network is updated in the M step. However, except for some special cases, the M step updates for MoE usually do not have a closed form and require iterative double loop procedures or iterative, recursive least squares solvers, which can be exhibit slow convergence. The lack of a closed form of parameter updates on MoE models can usually be (1) Due to the choice of gating network which is usually modeled by a softmax (in case we want a flat partitioning of the inputs), or modeled by a hierarchy of logistic models (in case we want a hierarchical partitioning, e.g., in hierarchical MoE models Jordan and Jacobs 1994; Bishop and Svenskn 2002); and (2) Due to the choice of the expert models which, for classification problems, are usually chosen to be logistic regression models (again lacking closed form parameter updates).

In this work, we present a probabilistic MoE framework which addresses the above issues in a principled manner. Our contributions can be summarized as follows:-

  • We propose a rich suite of gating networks with appealing properties, for both flat and hierarchical MoE (Bishop and Svenskn 2002). More importantly, the proposed inference algorithms lead to efficient, closed-form parameter updates using variable augmentation techniques. While there do exist, some inference methods for flat MoEs that have closed form updates (Yuksel et al. 2012), they generally come with the cost of learning a huge number of parameters. While, to the best of our knowledge, this is the first work that offers closed form parameter updates for hierarchical MoE.

  • One of our gating networks (logistic stick-breaking process based gating) can also learn the number of experts using a logistic stick-breaking prior (Ren et al. 2011), while still having simple closed form updates for the parameters, thanks to the variable augmentation scheme we employ.

  • Our MoE framework leverages Bayesian linear support machines (SVM) (Polson et al. 2011) as constituent experts. Bayesian SVMs as experts are appealing due to two reasons: (1) The model naturally enjoys the large-margin properties of SVMs; and (2) Variable augmentation techniques can also be exploited to derive closed-form parameter updates for Bayesian linear SVM parameters. Crucially, the MoE framework with Bayesian linear SVM results in a nonlinear Bayesian SVM without employing the recently proposed kernelized extensions of Bayesian SVM (Henao et al. 2014), which are difficult to do inference on and are slow at training and test time. Our probabilistic framework can also be a viable alternative to Gaussian Process (GP) based nonlinear classification, which usually lacks conjugacy and is hard to do inference on Nickisch and Rasmussen (2008).

  • We conduct extensive experiments to show that the use of closed-form updates lead to much faster training times compared to existing MoE frameworks whereas the use of Bayesian SVM as experts, leads to significant improvement in performance over competing methods for non-linear large margin models (Henao et al. 2014; Cotter et al. 2013; Wang and Zhu 2014; Zhu et al. 2011) as well as strong MoE baselines (Zhou 2016).

Our work lays out a novel and principled foundation to build MoE models, as well as it can be integrated into other more sophisticated models that require MoE as its building block, e.g., recent work on conditional computation in deep neural networks (Shazeer et al. 2017). Our inference algorithms with clean, closed-form updates are considerably more simple as compared to the existing inference methods for MoE.

2 Background

We first introduce some notation and provide a brief overview of the Mixture of Experts (Yuksel et al. 2012; Masoudnia and Ebrahimpour 2014) and Bayesian SVMs (Polson et al. 2011) which we use as experts in our proposed framework. We will focus on binary classification in which we assume we are given training data \(\mathcal {D} = \{\varvec{x}_i, y_i\}_{i=1}^N\) where \(\varvec{x}_i \in \mathbb {R}^d\) and \(y_i \in \{-1, +1\}\) and the goal is to learn a classifier that can predict the label \(y_* \in \{-1,+1\}\) for a new input \(\varvec{x}_*\). We will denote the feature matrix by \({\mathbf{X}}\in \mathbb {R}^{N \times d}\) and ground truth labels by \(\varvec{y} \in \{-1, +1\}^N\). Although we present our framework focusing on binary classification, it can be extended naturally to multi-class classification or regression problems.

2.1 Mixture of experts

A Mixture-of-Experts (MoE) model is defined by a set of K experts, each of which is a probabilistic supervised learning model \(\{p(y|\varvec{x},\varvec{w}_k)\}_{k=1} ^K\) parametrized by weights \({\mathbf{W}}= [\varvec{w}_1,\ldots ,\varvec{w}_K]\), with \(\varvec{w}_k \in \mathbb {R}^d\). In addition, the framework is defined by a gating network parametrized by \({\mathbf{V}}= [\varvec{v}_1,\ldots ,\varvec{v}_K]\), which defines the probabilities of assignment of an input \(\varvec{x}_n\) to each of the K experts. For example, the probability of input \(\varvec{x}_n\) being assigned to expert k is defined by a function \(\pi _k(\varvec{x}_n)\) where \(\pi _k(\cdot )\) is defined in terms of the parameters \({\mathbf{V}}\) of the gating network. Note that \(\sum _{k=1}^K \pi _k(\varvec{x}_n) = 1\). The MoE defines the conditional probability of \(y_n\) given \(\varvec{x}_n\) as

$$\begin{aligned} p(y_n|\varvec{x}_n,{\mathbf{W}},{\mathbf{V}}) = \sum _{k=1}^K \pi _k(\varvec{x}_n)\cdot p(y|\varvec{x}_n,\varvec{w}_k) \end{aligned}$$
(1)

Different versions of MoE primarily differ in the choice of the experts and the gating network (Yuksel et al. 2012; Masoudnia and Ebrahimpour 2014 provide an excellent overview). The gating networks can learn a flat or a hierarchical partitioning of the input space (the latter being the case with hierarchical mixture of experts Bishop and Svenskn 2002). It is their plug-and-play nature that provides a considerable flexibility in designing MoE models. However, learning the gating network parameters also poses difficulty in MoE models, where lack of conjugacy in typically used softmax/logistic gating usually prevents from getting closed form updates for the gating network parameters (and one has to rely on expensive procedures such as inner-loop iterative recursive least square algorithms). In this work, we proposed a suite of gating networks, along with efficient inference algorithms based on variable augmentation schemes, which allows us to derive closed form parameter updates for the gating network parameters.

2.2 Bayesian support vector machines

We use a probabilistic formulation of support vector machines—the Bayesian SVM (Polson et al. 2011)—as experts in our proposed MoE framework. As we will show, this choice leads to a simple and seamless inference procedure for all the parameters of our MoE framework using latent variable augmentation strategies. At the same time, our framework also enjoys the large-margin properties of SVMs.

The Bayesian SVM is based on representing the hinge-loss objective in non-probabilistic SVM as a scale-mixture of Gaussians (Polson et al. 2011).

$$\begin{aligned} \begin{aligned} \exp (-2\max (0,1-y_i\varvec{w}^T\varvec{x_i})) = \int _{0}^{\infty }\frac{1}{\sqrt{2\pi \gamma _i}}\cdot \exp \left( -\frac{(1+\gamma _i-y_i\varvec{w} ^\top \varvec{x_i})^2}{2\gamma _i}\right) d\gamma _i\\ \end{aligned} \end{aligned}$$
(2)

Equation 2 enables writing the negative of the hinge-loss \(\max (0,1-y_i\varvec{w}^T\varvec{x_i})\) on each example \(\varvec{x}_i,y_i\) as a Gaussian likelihood when conditioned on an auxiliary latent variable \(\gamma _i\) with an inverse-Gaussian distribution. As shown in Polson et al. (2011), the likelihood when conditioned on \(\gamma _i\), is of the form

$$\begin{aligned} p(y_i, \gamma _i|\varvec{w},\varvec{x}_i) = \mathcal {N}(1-y_i\varvec{w}^\top \varvec{x}_i|-\gamma _i,\gamma _i) \end{aligned}$$
(3)

This is attractive since the Gaussian likelihood is conjugate with Gaussian priors or scale-mixture of Gaussian priors on the weight vector \(\varvec{w}\), which leads to simple, closed form posterior updates.

Note that although the Bayesian SVM formulation proposed in Polson et al. (2011) is applicable for linear SVMs, our MoE based formulation would naturally, allow it to be used for solving nonlinear classification problems without having to rely on kernel extensions which are usually much more difficult to do inference on, and are computationally very expensive (Henao et al. 2014) since they work with kernel matrices.

There have been other works to tackle this problem such as Williams and Seeger (2001) and Rahimi and Recht (2008). While such approximations can, to some extent, alleviate the scalability issue, our MoE based framework offers several other benefits beyond just scalability, such as better interpretability and improved inference algorithms. The proposed MoE framework can also be easily used with kernel-based experts (like Gaussian Process) and these kernel-approximation techniques can further help in scaling up the model.

3 A probabilistic framework for large-margin mixture of experts

Our Mixture-of-Experts (MoE) architecture leverages Bayesian SVM as its constituent experts. Assuming the number of experts to be K (which, as we will show in Sect. 4.2, can also be learned from data), our MoE formulation uses K Bayesian SVMs defined by the set of weight vectors \({\mathbf{W}}= [\varvec{w}_1,\ldots ,\varvec{w}_K]\) where \(\varvec{w}_k\) denotes the weight vector of the kth Bayesian SVM expert.

We use \(z_{ik}=1\) to denote \(\varvec{z}_i=k\), i.e., the input \(\varvec{x}_i\) has been assigned to expert k. The expectations of the latent variables \(z_{ik}\) and \(\gamma _{ik}\), given the current estimates \(\hat{{\mathbf{W}}}\) and \(\hat{{\mathbf{V}}}\) of MoE model parameters, can be computed in closed form

$$\begin{aligned} \eta _{ik}= & {} \mathbb {E}[z_{ik}] \propto p(y_i | \hat{\varvec{w}}_k, \varvec{x}_i,\gamma _{ik}) p(z_{ik} = 1 | \varvec{x}_i, \hat{{\mathbf{V}}})\nonumber \\ \tau _{ik}= & {} \mathbb {E}[\gamma _{ik}^{-1}] = |1 - y_i\hat{\varvec{w}}_k^\top \varvec{x}_i|^{-1} \end{aligned}$$
(4)

The form of \(p(z_{ik} = 1 | \varvec{x}_i, \hat{{\mathbf{V}}})\) will depends on the form of gating network, discussed in Sects. 4.14.4.

We will use an Expectation–Maximization (EM) (Dempster et al. 1977) algorithm to learn the latent variables and the parameters. The E step will compute the expectation of the latent variables as defined in Eq. 4 and the M step will compute the point estimate of the MoE model parameters \({\mathbf{W}}\) and \({\mathbf{V}}\).

Denoting \(\mathbf{Z}= [\varvec{z}_1,\ldots ,\varvec{z}_N]\) as the input-to-expert assignment latent variables and \(\varvec{\gamma } = \{\gamma _{ik}\}_{i=1,k=1}^{N,K}\), the complete data log-likelihood of the model, as required to derive the EM algorithm will be

$$\begin{aligned} \begin{gathered} \mathcal {L}(\varvec{y}, \mathbf{Z}, \varvec{\gamma } | {\mathbf{X}}, {\mathbf{W}}, {\mathbf{V}}) = \sum _{i=1}^N \sum _{k=1}^K z_{ik}(\log {p(y_i, \gamma _{ik}| \varvec{x}_i, \varvec{w}_k)} + \log {p(z_{ik} = 1 | \varvec{x}_i, {\mathbf{V}})}) \end{gathered} \end{aligned}$$
(5)

For brevity, we have ignored the prior terms on \({\mathbf{W}}\), \({\mathbf{V}}\). We use zero-mean Gaussian prior on the Bayesian SVM weight vectors \({\mathbf{W}}= [\varvec{w}_1, \ldots ,\varvec{w}_K]\). The priors on the gating network parameters \({\mathbf{V}}\) will depend on the choice of the gating network. Also, there may be additional variable augmentations depending upon the gating network (which we discuss next), which will require to accordingly modify the gating network probability in Eq. 5.

The EM algorithm alternates between computing the expectations of the latent variables using Eq. 4 and using these expectations to maximize the complete data log-likelihood given by Eq. 5 to find the point estimates of the model parameters \({\mathbf{W}}\) and \({\mathbf{V}}\). Assuming Gaussian priors \(\mathcal {N}(\mathbf {0},\lambda ^{-1}\mathbf{I})\) on the Bayesian SVM weight vectors \(\varvec{w}_k\) which act as regularization for our SVM classifiers, the Gaussian likelihood in Eq. 3, leads to the following maximum-a-posteriori (MAP) estimate

$$\begin{aligned} \begin{aligned} \hat{\varvec{w}}_k = \left( \lambda \mathbf{I}+ \sum _{i=1}^N\eta _{ik}\tau _{ik}\varvec{x}_i\varvec{x}_i^\top \right) ^{-1}\left( \sum _{i=1}^Ny_i(\tau _{ik} + 1)\eta _{ik}\varvec{x}_i\right) \end{aligned} \end{aligned}$$
(6)

As can be seen from Eq. 6, the updates for the expert models resemble those computed in Polson et al. (2011), except the presence of \(\eta _{ik}\). The presence of \(\eta _{ik}\) reweighs the examples for each expert, thus, ensuring that each expert models only a subspace of the input. The estimates of \(\eta _{ik}\) are computed using the gating network in the E-step of the EM algorithm. Thus, the subspace of input that an expert models depends on the structure of the gating network, that is, how we define \(\pi _k(\varvec{x}_i) = p(z_{ik} = 1 | \varvec{x}_i, {\mathbf{V}})\). In the next section, we discuss such different gating networks. Note, however, the MAP estimate for the experts is agnostic to the structure of the gating networks and is given by Eq. 6.

4 Gating networks

As discussed above, the choice of the gating network can have a major impact on the final model that is learnt. In this work, we propose three separate choices for flat gating networks: (1) Generative gating network (Sect. 4.1); (2) Polya-Gamma based Softmax gating network (Sect. 4.2); (3) Logistic stick-breaking process (LSBP) prior (Ren et al. 2011) based gating network (Sect. 4.2). We also discuss variable augmentation to learn Hierarchical Mixture-of-Experts (Sect. 4.4). Further, we show, all our gating networks admit a closed form parameter update which is a particularly appealing aspect from the point of view of computational efficiency.

4.1 Generative gating

The Generative Gating (GG) network, as the name suggest, models input-expert assignment through Gaussian Mixture Model like input modelling as shown below:

$$\begin{aligned} \begin{aligned} \pi _j(\varvec{x}_i) = p(z_{ik}=1 | \varvec{x}_i,\alpha _k, \varvec{\mu }_k, \varvec{\varSigma }_k) \propto \alpha _j \mathcal {N}(\varvec{x}_i | \varvec{\mu }_k, \varvec{\varSigma }_k) \end{aligned} \end{aligned}$$
(7)

Here \(\alpha _k\) represents the prior probability of expert k and \(\varvec{\mu }_k, \varvec{\varSigma }_k\) denote the parameters of the Gaussian likelihood of \(x_k\) being assigned to expert k. Thus, the parameters of the gating network are given by \({\mathbf{V}}=\{\alpha _k, \varvec{\mu }_k, {\varvec{\varSigma }}_k\}_{k=1}^K\). This gating network seamlessly embeds within the EM algorithm and the gating network parameters have closed form updated (similar to generative mixture models for data). The Gaussian distributions here can be replaced by some other distributions as well depending on the type of the inputs.

Using the input to expert assignment probabilities \(\pi _j(\varvec{x}_i)\) defined above, we can compute the expectation \(\eta _{ik}\) in (4). The estimates of parameters in \({\mathbf{V}}\) can be computed by maximizing the expected complete data log-likelihood, under the additional constraint that \(\sum _ {k=1}^K\alpha _k = 1\). The updates for gating network parameters are as follows:

$$\begin{aligned} {\varvec{\mu }}_k= & {} \frac{\sum _{i=1}^N \eta _{ik}\varvec{x}_i }{ \sum _{i=1}^N \eta _{ik}} \end{aligned}$$
(8)
$$\begin{aligned} {\varvec{\varSigma }}_k= & {} \frac{\sum _{i=1}^N (\varvec{x}_i - {\varvec{\mu }}_k)(\varvec{x}_i - {\varvec{\mu }}_k)^T}{ \sum _{i=1}^N \eta _{ik}} \end{aligned}$$
(9)

From practical perspectives, we prefer to approximate \({\varvec{\varSigma }}_k\) using a diagonal covariance matrix, rather than the full covariance matrix (as implied by the expression above), which is computationally more efficient to compute (as the number of parameters to be estimated are linear in input data dimensionality as opposed to quadratic for a full covariance matrix). At the same time, we found the empirical performance of the diagonal covariance approximation to be comparable to that of a full covariance matrix. It is pertinent to mention that the generative gating, the simplest of our gating networks, and similar gating networks have been tried in other works as well (Meeds and Osindero 2006).

4.2 Pólya-gamma based softmax gating

The softmax gating is the most commonly used gating network for MoEs, dating back to the first work on MoEs (Jacobs et al. 1991) while also being used in recent works like (Shazeer et al. 2017) as well. However, these prior usages of softmax gating did not have closed form parameter updates and therefore, resort to iterative optimization. This makes our proposed Pólya-Gamma based Softmax Gating construction an appealing alternative.

We overcome the non-conjugacy of softmax gating through a latent variable augmentation scheme based on Pólya-gamma augmentation (SG-PG) (Polson et al. 2013) which leverages the following identity:

$$\begin{aligned} \frac{(e^\psi )^a}{(1 + e^\psi )^b} = 2^{-b}e^{(a - \frac{b}{2})\psi }\int _0^{\infty }e^{-\beta \psi ^2/2}p(\beta )d\beta \end{aligned}$$
(10)

where \(p(\beta )\) represents the density of \(\beta \sim PG(b, 0)\) with PG denoting the Pólya-gamma distribution (Polson et al. 2013). This result inspires the data augmentation described in Scott and Sun (2013) to develop Gibbs sampling or EM based routines for logistic regression, which can be extended to softmax functions. For that, we introduce latent variables \(\varvec{\beta } = \{\beta _{ij}\}_{i=1, j=1}^{i=N, j=K}\). In the softmax gating network, we are interested in learning \({\mathbf{V}}= [\varvec{v}_1, \ldots \varvec{v}_K]\). Note that this is equivalent to learning a multinomial classification model with data \(\{\varvec{x}_i,\varvec{z}_i\}_{i=1}^N\), albeit the “labels” \(\varvec{z}_i\) are latent variables.

The traditionally defined softmax probability of the assignment of an input \(\varvec{x}_i\) to an expert k, without the latent variable augmentation, can be expressed as:

$$\begin{aligned} p(z_{ik} = 1 | {\mathbf{V}}, \varvec{x}_i) = (1+\exp (-\psi _{ik}))^{-1} \end{aligned}$$
(11)

where \(\psi _{ik} = \varvec{x}_i^T\varvec{v}_k - \log {}\sum _{l=1, l\ne k}^K\exp (\varvec{x}_i^T\varvec{v}_l)\). This expresses the softmax probability as a Bernoulli likelihood, which can be decomposed by setting \(a=b=1\) in the PG identity. Using the augmentation results from Scott and Sun (2013), we get \(\log p(z_{ik} = 1 | \varvec{x}_i, \beta _{ik}, {\mathbf{V}}) \propto \psi _{ik}/2 - \beta _{ik}\psi _{ik}^2/2\). This expression is substituted in the expression of the complete data log-likelihood in (5). The EM algorithm also requires the expectations of the PG latent variables, which luckily are available in closed form

$$\begin{aligned} \chi _{ik} = \mathbb {E}[\beta _{ik}] = 0.5\hat{\psi }_{ik}^{-1}\tanh (0.5\hat{\psi }_{ik}) \end{aligned}$$
(12)

as derived in Polson et al. (2013). However, despite the augmentation, the parameter of \({\mathbf{V}}\) are coupled in the complete data log-likelihood due to coupling in \(\psi _{ik}\). This is resolved by assuming the values of \(\varvec{v}_l, l\ne k\) to be fixed to their most recent estimates while estimating \(\varvec{v}_k\). This is equivalent to conducting an Expectation Conditional Maximization (ECM) routine. Assuming a Gaussian prior on \(\varvec{v}_k\), i.e., \(\varvec{v}_k \sim \mathcal {N}(0, \rho ^{-1}\mathbf{I})\)., the resulting parameter updates for \(\varvec{v}_k\) are given by:

$$\begin{aligned} \hat{\varvec{v}}_k= & {} ({\mathbf{X}}^\top \hat{{\varvec{\varOmega }}}_k{\mathbf{X}}+ \rho \mathbf{I})^{-1}{\mathbf{X}}^\top (\hat{\kappa }_{1k}, \ldots , \hat{\kappa }_{Nk})^\top \end{aligned}$$
(13)
$$\begin{aligned} \hat{\kappa }_{ik}= & {} \eta _{ik}(0.5 + \chi _{ik}\log \sum _{l=1, l\ne j}^N\exp (\varvec{x}_i^\top \hat{\varvec{v}}_l))\nonumber \\ \hat{{\varvec{\varOmega }}}_k= & {} \text {diag} (\chi _{1k}\eta _{1k}, \ldots , \chi _{Nk}\eta _{Nj}) \end{aligned}$$
(14)

4.3 Logistic stick-breaking process gating

We present logistic stick-breaking gating network (Rigon and Durante 2017) based on the logistic stick-breaking prior (LSBP) (Ren et al. 2011). LSBP-Gating is a non-parametric gating network that partly relieves the machine learning practitioner from the need of tuning the number of experts as it “learns” the hyperparameter from the training data itself.

The LSBP construction specifies a large enough truncation level on the number of experts. The model prunes out unnecessary experts as warranted by the data, converging on the number of required experts. The gating probability for an input \(\varvec{x}_i\) is defined as:

$$\begin{aligned} \pi _k(\varvec{x}_i)= & {} {\left\{ \begin{array}{ll} \alpha _1(\varvec{x_i}) &{} \text {if}\quad k=1 \\ \alpha _k(\varvec{x_i}) \prod \nolimits _{l=1}^{k-1} (1-\alpha _l(\varvec{x}_i)) &{} \text {if}\quad 1< k < K-1 \\ \prod \nolimits _{l=1}^{K-1} (1-\alpha _l(\varvec{x}_i)) &{} \text {if}\quad k=K \end{array}\right. } \end{aligned}$$
(15)
$$\begin{aligned} \alpha _k(\varvec{x}_i)= & {} \left( 1 + \exp \left( -\nu _k^T\varvec{x}_i\right) \right) ^{-1} \end{aligned}$$
(16)

This essentially means that a data point is assigned to the first expert with probability \(\alpha _1(\varvec{x_i})\) while for the rest of the gates, it is defined as the product of the probability that it has not ‘yet’ been assigned to an expert and the probability it will be assigned to the given gate. It is easy to check that this is a proper distribution whereas, from a computational perspective, we specify a fixed (but a large) number of gates. Each of the gates is assigned some (but not all) probability mass. The remaining probability mass is then assigned to the last gate/expert. This ensures that the probabilities still sum to unity i.e., \(\sum _{k=1}^K\alpha _k(\varvec{x}_i) = 1\), where K is the large number of experts/truncation level.

Intuitively, LSBP gating network prunes the provided set of experts to be used, because of its bias towards the initial set of experts. As an artifact of the Stick-Breaking Process used in this construction, which is one of the possible representations of Dirichlet Process, experts are considered in increasing order of their index, and experts down the queue are only considered if the previous experts were inadequate to explain the labels. This is in contrast to other gating functions, where all experts are treated equally and therefore, it becomes likely that an increasing number of experts will overfit to the training data. We can effectively consider the LSBP gating performing the role of a shrinkage prior. We demonstrate both the shrinkage by LSBP and the overfitting of GG gating networks with an increasing number of experts in our experiments.

The EM updates for LSBP gating can be derived, again by plugging in the gating probabilities into Eq 5. As was the case for PG–SG gating, we need data augmentation for over-coming non-conjugacy due to the softmax. Hence during the E-step, the expected value of the auxiliary variable is as follows:

$$\begin{aligned} \omega _{ik} = \left( 2x_i^T\nu _k^{(t)}\right) ^{-1}\tanh \left( 0.5x_i\nu _k^{(t)}\right) \sum _{i=k}^K\pi _k(\varvec{x}_i) \end{aligned}$$
(17)

Then in the M-Step, using the auxiliary variable and gating probabilities from the previous step, the weights of the logistic are updated as follows:

$$\begin{aligned} \nu _k^{(t+1)}= & {} \left( {\mathbf{X}}^T\text {diag}(\omega _{1k}, \omega _{2k},\ldots , \omega _{nk}) + \varSigma _{\nu }^{-1}\right) ^{-1}\Big ({\mathbf{X}}^T\text {diag}(\kappa _{1k}, \kappa _{2k},\ldots ,\kappa _{nk}) \nonumber \\&+ \varSigma _{\nu }^{-1}\mu _{\nu }\Big ) \end{aligned}$$
(18)
$$\begin{aligned} {\varvec{\varSigma }}_{\nu }^{-1}= & {} \lambda _{{\textit{invreg}}}\mathbf{I}^{-1},\quad {\varvec{\mu }}_{\nu } \sim \mathcal {N}(0, \mathbf{I}) \end{aligned}$$
(19)
$$\begin{aligned} \kappa _{ih}= & {} \pi _{ik} - 0.5\sum _{i=k}^K\pi _{ik} \end{aligned}$$
(20)

4.4 Hierarchical gating (hierarchical mixture of experts)

We now shift our focus to Hierarchical MoEs (HME). First proposed in Jordan and Jacobs (1994), it assumes a hierarchical partitioning of the input space, providing an alternative to partitioning provided by the flat gating networks. Despite its success, an efficient algorithm with closed-form parameter updates has eluded Hierarchical MoEs (to the best of our knowledge). Interestingly, we can again leverage the latent-variable augmentations to construct HMEs which can be trained efficiently. The fundamental observation is that all internal nodes (non-leaf nodes) in a hierarchical MoE are essentially performing classification, and thus, can be constructed using Bayesian SVMs. While we focus on the construction of HMEs using Bayesian SVMs, HMEs can similarly be constructed using logistic models while using Pólya-Gamma augmentation.

We will restrict our discussion of Hierarchical MoEs to complete binary trees (of arbitrary depths), as was done in Jordan and Jacobs (1994). In these models, the internal nodes play the function of gating and the leaf nodes play the role of experts. Since we are using only binary nodes, we construct our HME using Bayesian SVMs. We can use multi-class Bayesian SVMs in case of non-binary trees.

Fig. 1
figure 1

(Left) a A diagrammatic representation of the structure of a 3-level Hierarchical MoE (right). b Final decision boundaries learnt by a 7-level Hierarchical MoE on banana dataset and (bottom). c Partitions learnt by different nodes of a 3-level Hierarchical MoE (black: root partition, red and blue: children partitions) (Color figure online)

Since, the notation for Hierarchical MoEs can get cumbersome, we restrict our discussion to a Hierarchical MoE shown in Fig. 1. Here, \(\varvec{W} = [\varvec{w}_1, \ldots \varvec{w}_7]\) denote the weight vectors. Also, we assume the classification labels to be \(y \in \{0,1\}\). If an internal node predicts 0, left child is chosen for the next prediction, or else right. Let \(z_{ij} = 1\) denote the event that node j was chosen in the prediction for \(\varvec{x} _i\). We can define the recursive probability rule (assume integer division for indices and % denotes the modulo operation)

$$\begin{aligned} p(z_{ik} = 1 | \varvec{x}_i, {\mathbf{W}}) \propto p\left( k\%2 | \varvec{x}_i, \varvec{w}_{i\frac{k}{2}}\right) p(z_{i\frac{k}{2}} = 1 | \varvec{x}_i, {\mathbf{W}}) \end{aligned}$$
(21)

Note that, \(p(y | \varvec{x}_i, \varvec{w}_k) \propto \exp (-2\max (0,1-2(y-1)\varvec{w}_k^T\varvec{x}_i))\) (Bayesian SVM) and \(p(z_{i1} = 1 | \varvec{x}_i, {\mathbf{W}}) = 1\) (the root node is necessarily visited). The final prediction is the weighted combination of predictions by leaf nodes, weights being computed by normalizing \(p(z_{ij} = 1 | \varvec{x}_i, {\mathbf{W}})\) for leaf nodes. When learning the model, we will assume \(z_{ij}\) to be a latent variable. For leaf nodes, we introduce \(\gamma _{ij}\) as latent variable for Bayesian SVM. However, for all internal nodes, we introduce two latent variables for every example \(\varvec{x}_i\), that is \(\gamma _{ij}^0, \gamma _{ij}^1\). Internal nodes take both labels (albeit with different probabilities) unlike the leaf nodes which have a deterministic label from the training data. The likelihood function from Eq. 5 has to be changed to the following

$$\begin{aligned} \mathcal {L}(\varvec{y}, \mathbf{Z}, \varvec{\gamma } | {\mathbf{X}}, {\mathbf{W}}, {\mathbf{V}})= & {} \sum _{i=1}^N \sum _{k=4}^7 z_{ik}(\log {p(y_i, \gamma _{ik}| \varvec{x}_i, \varvec{w}_k)}\nonumber \\&+ \log {p(z_{ik} = 1, \varvec{\gamma } | \varvec{x}_i, {\mathbf{W}})}) \end{aligned}$$
(22)

Here, \(p(z_{ik}=1, \varvec{\gamma } | x_i, {\mathbf{W}}) \propto p(k\%2, \gamma _{i\frac{k}{2}}^{k\%2} | \varvec{w}_{k/2}, \varvec{x}_i) \times p(z_{i\frac{k}{2}} = 1, \varvec{\gamma } | \varvec{x}_i, {\mathbf{W}})\). Now, we need to compute the following expectations in the E step (besides \(\eta _{ik}\) which retains its definition)

$$\begin{aligned} \tau _{ik}^j = \mathbb {E}\left[ (\gamma _{ik}^j)^{-1}\right] = | 1-(j-1)\varvec{x}_i^\top \varvec{w}_k|^{-1} \text { } (j=0,1, k \not \in \text {leaf}) \end{aligned}$$
(23)

\(\tau _{ik}\) for leaf nodes retain their original definition. Solving the likelihood, we get the update \( \varvec{w}_k = ({\mathbf{X}}^\top {\mathbf{A}}_k{\mathbf{X}}+ \lambda \mathbf{I})^{-1}\left( \sum _{i=1}^Nb_{ik}\varvec{x}_i\right) \)

Here, \({\mathbf{A}}_k = \text {diag}(a_{1k}, \ldots a_{nk})\), \(a_{ik} = \eta _{ik}\tau _{ik}^0 + \eta _{ik}'\tau _{ik}^1\) and \(b_{ik} = \eta _{ik}'(1 + \tau _{ik}^1) - \eta _{ik}(1 + \tau _{ik}^0)\). For internal nodes, \(\eta _{ik}\) represents the posterior probability that \(\varvec{x}_i\) goes in the left subtree of node k which is equivalent to the sum of posterior probabilities of experts in its left subtree. Similarly, \(\eta _{ik}'\) represents the posterior probability of right subtree. As a sanity check, following this update for the leaf nodes is equivalent to the MAP estimate in Sect. 2.2.

4.5 Discussion on gating networks

With the construction discussed in the previous sections, we now have a plug-and-play setup, where we can switch our gating networks independent of our experts. While it is difficult to say apriori which network will perform better, some properties of these networks merit discussion. GG, a fairly robust construction, in part contributed to the success of MoEs (Yuksel et al. 2012; Meeds and Osindero 2006). The problem of having \(\mathcal {O}(KD^2)\) parameters can be ameliorated by using diagonal co-variances to an extent as suggested earlier. However, our other gating network architectures (softmax, LSBP, HME) require far fewer parameters to be estimated, without further approximations.

Although the real benefit of augmentation schemes can be seen in SG–PG and HME constructions. Softmax gating networks previously relied on an IRLS optimization with an EM iteration, which is extremely expensive while Hierarchical models never had an algorithm with closed form updates and again relied on such iterative schemes for optimization. Both these constructions can be efficiently learned using variable augmentation schemes we have discussed.

We have also discussed a LSBP which provides a non-parametric construction to automatically prune experts, reducing the number of hyperparameters associated with the model and thereby relieving the machine learning practitioner of computational burden of tuning the number of experts. This is a significant (and desirable) modeling departure from the current machine learning models/algorithms which seem to be predicated upon an ever-increasing number of hyperparameters. However, there is a tradeoff here. GG has only K augmentations, SG–PG and LSBP have 2K augmentations while HME has \(3K-2\) Augmentations per data point. More the number of augmentations, more the network becomes susceptible to poor initialization. As expectation–maximization only guarantees convergence to a local maximum (Balakrishnan et al. 2017), poor initialization becomes an issue with increasing augmentations. We can partly resolve this by re-running models with different random initializations. However, better analysis of initialization of augmented models will be a fruitful future endeavor.

5 Related work

Mixture of Expert (MoE) based models have had a rich history in machine learning, starting with the early works by Jacobs et al. (1991), Jordan and Jacobs (1994). An excellent survey can be found in Yuksel et al. (2012). Advances in MoE based models have been in several directions, primarily focusing on better gating networks, e.g., generative gating (Xu et al. 1995; Meeds and Osindero 2006), or in using more powerful/flexible experts, such as Gaussian Processes (Yuan and Neubauer 2009). However, these MoE formulations are difficult to do inference on and, moreover, the experts are limited to probabilistic models. The MoE classification setting is even more challenging because not only the gating network is difficult to learn, learning of the experts too is difficult due to non-conjugacy of the underlying model. The hierarchical MoE (HMoE) model offers a better interpretability than flat MoE. However, inference in HMoE is even more challenging, which has precluded its adoption by practitioners. Our proposed large-margin MoE framework with diverse choices for gating network and efficient closed-form inference in each of the cases is an attempt to address these issues.

From the perspective of MoE’s ability to learn nonlinear models, our Bayesian SVM based MoE framework is similar in spirit to the work on kernel-based Bayesian SVMs proposed recently (Henao et al. 2014). However, their model does not admit as simple inference procedure as ours and is also much more computationally expensive as compared to our model. In our experiments, we also compare with this model and our results show that our model achieves much better classification accuracies, despite not using kernels. Therefore, our Bayesian SVM based MoE framework can also be an attractive method for Bayesian nonlinear large-margin classification, without having to use kernels.

6 Experiments

We present experimental results for our proposed MoE framework on several benchmark datasets and compare with various state-of-the-art baselines, both MoE based as well as Bayesian models designed for nonlinear classification. We retain our abbreviations for the four variants from the previous sections.

Fig. 2
figure 2

A visualization of the decision boundaries learnt by our Gaussian Generative Gating model as the number of experts are increased from 1 to 50

Firstly, we show a visualization of the decision boundaries learned by our MoE model (with generative gating network) on banana dataset as the number of experts is varied in Fig. 2. It is clear that a larger number of experts are able to generate increasingly non-linear decision boundaries.

6.1 Experimental settings

Our MoE set up consists of two main hyper-parameters. The number of experts for flat gating networks/the number of levels for hierarchical gating network and a regularization parameter for the weight vectors of the expert and the gating networks. Throughout this work, we use zero-mean Gaussian priors for our weight vectors which amounts to L2 regularization over the weight vectors for both gating networks and experts.

The weight matrices for our models are initialized using random samples from \(\mathcal {N}(0, I)\). We experimented with experts ranging from 2 to 20 for flat models while we varied the number of levels between 3 and 6 (equivalent to 4–32 experts) for hierarchical models. The regularization parameter was tuned between .01 to 100. Finally, for the number of iterations, we set a cap of 25 iterations for generative gating (GG) and 100 for other networks (as GG tends to converge much earlier than others), with a convergence threshold of .01 on the expected complete log-likelihood. In order to tune these hyper-parameters, we perform a simple grid search over some pre-specified values from the respective ranges (specified above) of the hyperparameters.

Table 1 Description of all the datasets used in the experimentation section, where a hyphen under \(N_{{\textit{test}}}\) indicates that the dataset did not come with a pre-defined train/test split

To ascertain the best combination of hyper-parameters, we perform a cross-validation over the training set, similar to the procedure in Henao et al. (2014). For datasets in the Table 3, we perform a fivefold validation, while for datasets in Table 2, we do a tenfold validation along the lines of Henao et al. (2014) and Zhou (2016) respectively. For the latter, we reuse the first ten pre-defined splits available here.Footnote 1 The dataset statistics are summarized in Table 1.

6.2 Baselines

We compare our model against various MoE and non-linear SVM baselines. We briefly describe these baselines below:

  • Sum-Stacked-Softplus (SS-\(\tau \)) classifier and Stacked-Softplus classifier (Stack-\(\tau \)) (Zhou 2016): These are the present state-of-art MoE classification models that use a family of softplus functions, convolving countably infinite stacked gamma distributions. Here SS-\(\tau \) consists of hierarchies (akin to our Hierarchical MoE), while Stack-\(\tau \) is flat.

  • Adaptive Multi-Hyper-plane Machines (AMM) (Wang et al. 2011): AMMs provide a framework to adapt the number of hyper-planes used for classification. It prunes the weights of the learnt model to improve computational efficiency at test time for SVMs.

  • Optimally Sparse SVM (OSSVM) (Cotter et al. 2013): OSSVM provides an algorithm to sparsify the support vectors learnt for the data when doing non-linear classification with kernels for SVMs. It provides theoretical guarantees for the number of support vectors learnt by the algorithm.

  • Bayesian Non-Linear SVM (BSVM) (Henao et al. 2014): Bayesian formulation of the SVM Hinge loss which uses a Gaussian Process Classifier. This is optimized using Expectation Conditional Maximization (ECM).

  • GP Classifier (GPC): Standard off-the-shelf GP based classifier which uses a Gaussian ernel for non-linear classification.

  • Relevance Vector Machines (RVM) (Tipping 2001): Probabilistic formulation of SVMs using generalized linear models for classification.

  • RBF-Kernel SVM (RBF-SVM) and Linear Regression (LR): Standard baselines for comparison.

Table 2 Classification error rates (Part 1) (results in bold indicate the lowest mean error for the dataset)
Table 3 Classification error rates (Part 2) (results in bold indicate the lowest mean error for the dataset)

The results, summarized in Tables 2 and 3 show that most of our models outperform the baselines, on all datasets except image and IJCNN (where GG and LSBP come close). Particularly notable is the performance of our models on the breast cancer and Sonar dataset where all our models outperform the baselines by a margin.

We have also benchmarked against the Small Variance Dirichlet Process Mixture SVM (\(M^{2}DPM\)) proposed by Wang and Zhu (2014), infinite-SVM (Zhu et al. 2011) and dpMNL (Shahbaba and Neal 2009). The comparisons with these models can be found in the “Appendix”.

Fig. 3
figure 3

Confusion matrices for our models on the image dataset (left to right: GG, LSBP, PG, HME)

In Fig. 3, we provide a confusion matrix as an analysis of the of the errors made by different gating networks.

6.3 Automatic tuning of experts using LSBP

Fig. 4
figure 4

Left: test accuracy versus # of experts in GG and LSBP models Right: number of points assigned to each expert in banana dataset with 10 experts

LSBP gating successfully prunes out the unnecessary experts without overfitting while GG tends to overfit as K increases. This is shown in the left plot in Fig. 4 which compares GG and LSBP, indicating the robustness of LSBP to increasing number of experts. For the graph on the right, we compared the distribution of training data assignment to the experts when we specify a large number of experts (\(K=16\)). Here, LSBP gating model assigns the data points to only the first three experts whereas GG gating distributes data point among all the experts. This clearly shows the LSBP model’s ability to do shrinkage and provide comparable performance with a smaller # of experts.

6.4 Timing experiments

6.4.1 Hierarchical MoE

We first compare our HME (VA-HME) with a traditional HME (BFGS-HME) formulation of the same structure, which approximately follows Jordan and Jacobs (1994). For the traditional HME, we have a softmax classifier for both internal and leaf nodes. Within the EM iterations, the weights of the nodes are optimized using L-BFGS. We carry out a fixed number of EM iterations (100). The experiment is repeated 20 times for every tree level, varied from 2 to 7. We report both real computation time, and CPU time. Note, both codes were implemented in python, using numpy and scipy. In particular, no effort was made to parallelize the code (although inbuilt library functions could have, due to which the difference in real time is observed). From the data, our HME model is nearly 5 times faster in real time and 2.5 times faster in computation time (Fig. 5).

Fig. 5
figure 5

Timing comparison of our proposed inference algorithm for HME and traditional EM-BFGS based algorithm for HME

6.4.2 Flat MoE

To emphasize the role of closed-form (CF) updates, we do a runtime comparison of different MoE models. We compare with GG + Logistic Regression (LR) Experts, where LR experts are optimized using L-BFGS. Similarly, we compare with Softmax Gating (SG) with Bayesian SVM experts, where SG is optimized using L-BFGS. Also, we do a comparison with SG+LR which does gradient descent (using Adam) directly on the incomplete likelihood.

Table 4 Running-time (in seconds) for different models as a function of number of experts on image dataset [Format: User Time (System Compute Time)]

Note, that our models with closed-form updates, which are implemented in Python + NumPy are at a significant disadvantage to the rest of these implementations, which have highly optimized C-based backend implementation (PyTorch for Adam, SciPy for L-BFGS). Yet, from Table 4, we can see that (1) our BSVM based MoE models are much faster than traditional LR based MoE that relies on iterative update (no closed-form updates), and (2) L-BFGS scales rather poorly.

Table 5 A comparison of training times of generative gating with several baseline models and varying data dimensionality

Next, we evaluate the training time of Generative Gating as the data dimensionality increases in Table 5. This shows how our models scale with dimensions when compared with other baselines and also with the standard Kernel-SVM. Our proposed models still present themselves as the computationally efficient alternative.

Fig. 6
figure 6

Training accuracy as a function of time (left) and as a function of iterations (right) with 128 experts for all models

Finally, we compare the number of iterations required for different iterative algorithms to converge. As can be seen from Fig. 6 (right), our proposed models and the competing models have a comparable number of iterations needed to reach maximum accuracies (except for the Adam optimized SG+LR MoE model that needs a large number of iterations).

We also visualize the training accuracies as a function of computing time in Fig 6(left). This figure again illustrates the fact that both of our models can achieve higher accuracies in a lesser amount of time, which further strengthens the argument for closed form updates.

6.5 On role of Bayesian SVMs

Table 6 Error rate comparison to show the significance of Bayesian SVM as experts (CF: Closed Form Updates) on Set I

We highlight the role of Bayesian SVMs. Particularly, we compare against Logistic Regression experts (GG+LR) instead of Bayesian SVM (GG+BSVM), and a non-closed-form L-BFGS update version of Softmax Gating which still uses Bayesian SVM (SG+BSVM). The results are summarized in Table 6. Bayesian SVMs consistently increase the accuracies when compared to Logistic Regression experts. SG + BSVM also seems to give similar performance to GG + BSVM, which indicates that Bayesian SVMs are responsible for the performance boost in our earlier experiments.

7 Conclusion

We have presented a novel mixture of experts based framework which uses large-margin classifier like SVM as experts by using the Bayesian formulation of SVM, and consists of a variety of gating networks, each of which admits a simple inference procedure with closed-form updates for the model parameters. One of the gating networks (LSBP) also allows learning the number of experts from data, while the hierarchical MoE offers more interpretability as compared to MoE with gating networks that rely on flat partitioning. Our framework, with each of the gating networks, also enjoys conjugacy in all of its components (both Bayesian SVM as well as the gating network), which makes it easy to do inference. In addition to providing a rich and flexible framework for learning MoE based models, as our experiments demonstrate, our framework is also competitive to Bayesian nonlinear classification models that use kernels (Henao et al. 2014).