Continuous Semi-Supervised Nonnegative Matrix Factorization
Next Article in Journal
Editorial: Surveys in Algorithm Analysis and Complexity Theory (Special Issue)
Previous Article in Journal
Temporal Multimodal Data-Processing Algorithms Based on Algebraic System of Aggregates
Previous Article in Special Issue
A Comparison of Different Topic Modeling Methods through a Real Case Study of Italian Customer Care
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Continuous Semi-Supervised Nonnegative Matrix Factorization

1
School of Mathematical and Statistical Sciences, The University of Texas Rio Grande Valley, Edinburg, TX 78539, USA
2
Department of Mathematics, University of California Los Angeles, Los Angeles, CA 90095, USA
*
Author to whom correspondence should be addressed.
Algorithms 2023, 16(4), 187; https://doi.org/10.3390/a16040187
Submission received: 15 January 2023 / Revised: 21 February 2023 / Accepted: 28 February 2023 / Published: 30 March 2023
(This article belongs to the Special Issue Algorithms for Non-negative Matrix Factorisation)

Abstract

:
Nonnegative matrix factorization can be used to automatically detect topics within a corpus in an unsupervised fashion. The technique amounts to an approximation of a nonnegative matrix as the product of two nonnegative matrices of lower rank. In certain applications it is desirable to extract topics and use them to predict quantitative outcomes. In this paper, we show Nonnegative Matrix Factorization can be combined with regression on a continuous response variable by minimizing a penalty function that adds a weighted regression error to a matrix factorization error. We show theoretically that as the weighting increases, the regression error in training decreases weakly. We test our method on synthetic data and real data coming from Rate My Professors reviews to predict an instructor’s rating from the text in their reviews. In practice, when used as a dimensionality reduction method (when the number of topics chosen in the model is fewer than the true number of topics), the method performs better than doing regression after topics are identified—both during training and testing—and it retrains interpretability.

1. Introduction

Nonnegative matrix factorization (NMF) is a highly versatile data science technique with far-reaching applications. It can identify thematic elements, i.e., groups of words that appear frequently together in a corpus, which together convey a common message [1]. More generally, it can be used to decompose an image into identifiable patterns [2] and as a general-purpose dimensionality reduction or preprocessing method before applying other machine learning methods, as has been done in studying various diseases [3,4]. Similar to singular value decomposition (SVD) [5], NMF provides a low rank factorization. In NMF, a nonnegative matrix X R 0 n × m representing a corpus (or other nonnegative dataset) is factored into a low rank approximation X W H where the inner dimension, r, between W and H is such that r m and r n ; however, unlike SVD, there is an additional constraint that both W and H are nonnegative, i.e., W R 0 n × r and H R 0 r × m . This nonnegativity enforces that the data in X are represented by a nonnegative combination of the dictionary atoms in the factorization, which lends itself to human interpretability. For example, in the foundational work [2], Lee and Seung show that NMF, when applied to facial images, decomposes the images into recognizable parts such as noses, eyes, and mouths.
When applied to a document-term matrix [6] X, where row i of X represents document i and column j represents the frequencies of word j across the documents, the classical NMF method amounts to
( W , H ) = arg min W R 0 n × r , H R 0 r × m | | X W H | | F 2
where, for A R n × m , | | A | | F denotes the Frobenius norm of A with | | A | | F 2 = tr ( A T A ) = i = 1 n j = 1 m | A i j | 2 . Other variations on the penalty function exist, including the Kullback–Leibler divergence [7]. After computing W and H, we interpret row j of H as the jth topic, its components being the weight of each word in that topic, and row i of W as the topic-encoding of document i, i.e.,
X i , : j = 1 r W i , j H j , : .
Throughout this manuscript we make use of “colon notation” where “:“ means the full range of indices for a row/column, “a:b” indicates a consecutive range of indices from a to b, etc.
The rest of our paper is organized as follows: the general context to our work and our main contributions are stated in Section 2; in Section 3, we provide the formulation of our method, its algorithmic implementation, its theoretical properties, and their proofs; in Section 4, we provide a proof of concept through synthetic data; in Section 5, we test our method on a real dataset from Rate My Professors; and finally we conclude our work in Section 6.

2. Relation to Current Work and Contributions

When designing algorithms to handle multiple objectives simultaneously, a weighted penalty function that combines the multiple objectives is often used [8]. For example, with LASSO regression [9], one seeks a linear model that also does model selection by zeroing out parameters that are less significant. This can be done by minimizing a least squares error with a weighted penalty for the 1 -norm of the parameters. Prior authors have combined NMF with a linear regression procedure to maximize the predictive power of a classifier [10,11,12]. This is accomplished through a penalty function that combines NMF with another objective function—a (semi) supervised approach. Semi-supervised NMF can also be applied to guide NMF to identify topics with desired keywords [13].
In this paper, we combine NMF with a linear regression model to predict the value of a continuous response variable. We consider synthetic data along with a real dataset that pairs written commentary with a real-valued observation. In particular, we use reviews from the Rate My Professors website [14,15] that include all student comments for a professor along with the mean rating in [1, 5]. Due to the averaging, the rating is effectively a continuous variable.

3. Model

We provide the framework for our proposed continuous semi-supervised nonnegative matrix factorization method (CSSNMF).

3.1. Formulation

We consider having a document-term matrix [6] X R 0 n × m for n documents with their associated word frequencies in the m columns—a “bag of words” where each document is represented only by the frequencies of its words. Each document has a corresponding value in R so that we can associate with X the vector Y R n × 1 . Put another way: each document is represented as a row of X, call it x R 0 1 × m , which stores the frequencies of each of the m words within the corpus; then, to each such x there is an observation y R (and over n documents, this generates Y R n × 1 ). We choose r N and λ 0 as hyper-parameters where r denotes the number of topics and λ is weight put on the regression error.
In the real data that we look at, each row of X will represent the reviews written for one university instructor, with frequencies of the words in the columns. For each instructor in the dataset, the mean value of their respective student ratings will be a single component of Y. From a predictive standpoint, we would like to predict the mean rating of an instructor based only on the words in their review, i.e., take a vector x R 0 1 × m of the word frequencies and make a prediction y ^ of their mean rating (the hat indicates a prediction). We want the prediction y ^ to be as close to the true mean rating as possible. The topic modelling aspect of this is that instead of using the full x vector of dimension m, we approximate x as a nonnegative linear combination of r topic vectors (interpretable vectors of word frequencies). We effectively compress x to a vector w R 0 1 × r of dimension r , and we model the rating as a linear combination of the components of w.
Given W R 0 n × r , H R 0 r × m , and θ R ( r + 1 ) × 1 , we define a penalty function that combines topic modelling with a linear regression based on the topic representations. The intuition with the weighting is that as the weight λ increases, topic modelling is still done, but more and more emphasis is put on producing an accurate regression on Y. We define
F ( λ ) ( W , H , θ ; X , Y ) = N ( W , H ; X ) + λ R ( W , θ ; Y ) , where
N ( W , H ; X ) : = | | X W H | | F 2
R ( W , θ ; Y ) : = | | W ˜ θ Y | | 2
and where W ¯ R n × ( r + 1 ) is given by
W ¯ : = 1 W 1 , 1 W 1 , m 1 W 2 , 1 W 2 , m 1 W n , 1 W n , m .
The matrix W ¯ with its column of 1s allows for an intercept: given a topic representation w R 1 × r , we predict a value y ^ = θ 1 + θ 2 w 1 + + θ r + 1 w r .
When  λ > 0 , we seek
( W ( λ ) , H ( λ ) , θ ( λ ) ) = arg min W , H , θ F ( λ ) ( W , H , θ ; X , Y ) .
When  λ = 0 , we define
( W ( 0 ) , H ( 0 ) ) = arg min W , H N ( W , H ; X )
θ ( 0 ) = arg min θ R ( W ( 0 ) , θ ; Y ) .
We also impose a normalization constraint, that
i , j = 1 m H i j = 1
so that the topics have unit length in 1 .
Remark 1
(Sum of Topic Representations). If X W H is normalized so its rows sum to 1 then it is also the case that i , j = 1 r W i j 1 by noting that X i j = k = 1 r W i k H k j and summing over j.
When λ = 0 , θ has no effect upon F ( λ ) and we first perform regular NMF over W and H and, as a final step, we choose θ to minimize the regression error. In other words, if λ = 0 , we do NMF first and then find the best θ given the already determined weights for each document. It seems intuitive, however, that the regression could be improved if θ and W both were being influenced by the regression to Y, which is what our method aims to do when λ > 0 . From a practical perspective, if λ , then the regression error becomes dominant and we may expect the topics as found in H to be less meaningful. In Section 3.2, we state some theoretical properties of our method as it is being trained.
Once H ( λ ) and θ ( λ ) are known, we can make predictions for the response variable corresponding to a document. This amounts to finding the best nonnegative topic encoding w R 0 1 × r for the document and using that encoding in the linear model—see Section 3.3.
Remark 2
(Uniqueness). Using our established notation, we remark that if X * = W H and Y * = θ 1 + W θ 2 : ( r + 1 ) , then X * = W ˜ H ˜ and Y * = θ 1 + W ˜ θ ˜ , where W ˜ = W S , H ˜ = S 1 H , and θ ˜ = S 1 θ 2 : ( r + 1 ) for any invertible S R r × r with S W and S 1 H both having all entries nonnegative. Thus, uniqueness of an optima, if it exists, can only be unique up to matrix multiplications.

3.2. Theoretical Results

We present two important behaviors of CSSNMF with regard to increasing λ and its effect upon predicting the response variable.
Proposition 1
(Regression Error with Nonzero λ ). For λ 0 , let W ( λ ) , H ( λ ) , θ ( λ ) be a unique (as per Remark 2) global minimum to Equations (6)–(9). Then R ( W ( λ ) , θ ( λ ) ) R ( W ( 0 ) , θ ( 0 ) .
Theorem 1
(Weakly Decreasing Regression Error). Let 0 λ 1 < λ 2 be given where W ( λ i ) , H ( λ i ) , θ ( λ i ) are the unique (as per Remark 2) global minimizers of Equations (6)–(9) for i = 1 , 2 . Then R ( W ( λ 2 ) , θ ( λ 2 ) ; Y ) R ( W ( λ 1 ) , θ ( λ 1 ) ; Y ) .
Remark 3.
Proposition 1 and Theorem 1 are based on obtaining a global minimum. In practice, we may only find a local minimum.
Proposition 1 and Theorem 1 are statements pertaining to training the model. Assuming we have the optimal solutions, Proposition 1 tells us that the regression error for λ > 0 is no worse than the regression error with λ = 0 and could, in fact, be better. Thus, the intuition that selecting topics while paying attention to the regression error is practical. Then Theorem 1 says that the regression error is weakly monotonically decreasing as λ increases. In practical application, we find the error strictly monotonically decreases.
Before proceeding to algorithmic procedures, we prove Proposition 1 and Theorem 1.
Proof of Proposition 1.
If λ > 0 then
F ( λ ) ( W ( λ ) , H ( λ ) , θ ( λ ) ; X , Y ) F ( λ ) ( W ( 0 ) , H ( 0 ) , θ ( 0 ) ; X , Y ) N ( W ( λ ) , H ( λ ) ; X ) + λ R ( W ( λ ) , θ ( λ ) ; Y ) N ( W ( 0 ) , H ( 0 ) ; X ) + λ R ( W ( 0 ) , θ ( 0 ) ; Y ) λ ( R ( W ( λ ) , θ ( λ ) ; Y ) R ( W ( 0 ) , θ ( 0 ) ; Y ) ) N ( W ( 0 ) , H ( 0 ) ; X ) N ( W ( λ ) , H ( λ ) ; X ) 0 .
The first inequality comes from how ( W ( λ ) , H ( λ ) , θ ( θ ) ) are defined by Equation (6). The final inequality comes from how ( W ( 0 ) , H ( 0 ) ) are defined as minimizers in Equation  (7).
Since we first assumed λ > 0 , we obtain R ( W ( λ ) , θ ( λ ) ; Y ) R ( W ( 0 ) , θ ( 0 ) ; Y ) . Finally, if λ = 0 , then there is equality with R ( W ( λ ) , θ ( λ ) ; Y ) = R ( W ( 0 ) , θ ( 0 ) ; Y ) .    □
Proof of Theorem 1.
Note that if λ 1 = 0 , then Theorem 1 already applies, so we assume 0 < λ 1 < λ 2 . We have that
F ( λ 1 ) ( W ( λ 1 ) , H ( λ 1 ) , θ ( λ 1 ) ; X , Y ) F ( λ 1 ) ( W ( λ 2 ) , H ( λ 2 ) , θ ( λ 2 ) ; X , Y ) λ 1 R ( W ( λ 1 ) , θ ( λ 1 ) ; Y ) R ( W ( λ 2 ) , θ ( λ 2 ) ; Y ) N ( W ( λ 2 ) , H ( λ 2 ) ; X ) N ( W ( λ 1 ) , H ( λ 1 ) ; X ) .
We also have
λ 2 ( R ( W ( λ 2 ) , θ ( λ 2 ) ; Y ) R ( W ( λ 1 ) , θ ( λ 1 ) ; Y ) ) N ( W ( λ 1 ) , H ( λ 1 ) ; X ) N ( W ( λ 2 ) , H ( λ 2 ) ; X ) .
Adding Equations (10) and (11) together,
( λ 1 λ 2 ) R ( W ( λ 1 ) , θ ( λ 1 ) ; Y ) + ( λ 2 λ 1 ) R ( W ( λ 2 ) , θ ( λ 2 ) ; Y ) 0
which, upon dividing by λ 2 λ 1 > 0 , directly gives
R ( W ( λ 2 ) , θ ( λ 2 ) ; Y ) R ( W ( λ 1 ) , θ ( λ 1 ) ; Y ) .
   □

3.3. Algorithm

Our minimization approach is iterative and based on the alternating nonnegative least squares [16] approach. Due to the coupling of NMF and regression errors, other approaches such as multiplicative or additive updates [17] are less natural. Each iteration consists of: (1) holding H and θ fixed while optimizing each row of W separately (nonnegative least squares); (2) holding W and θ fixed while optimizing each column of H separately (nonnegative least squares); and finally (3) holding W and H fixed while optimizing over θ . The error is nondecreasing between iterations and from one optimization to the next. We now derive and justify this approach (Algorithm 1) in increasing complexity of cases.  
Algorithm 1: Overall CSSNMF algorithm.
Input   :A matrix X R 0 n × m ,
              a vector Y R n × 1 ,
              a positive integer r N ,
              a scalar λ 0 ,
              a relative error tolerance τ > 0 , and
              a maximum number of iterations m a x I t e r .
Output:Minimizers of Equations (6)–(9): nonnegative matrix W R 0 n × r ,
              nonnegative matrix H R 0 r × m , and
              vector θ R ( r + 1 ) × 1 .
1 r e l E r r = , e r r =
2Elementwise, W U n i f ( [ 0 , | | X | | ) ) , H U n i f ( [ 0 , | | X | | ) ) ,
  θ U n i f ( [ 0 , | | X | | ) ) .
3 i t e r = 0
4while  r e l E r r > τ  and  i t e r < m a x I t e r  do
5 W n e w W as per Algorithm 2
6 H n e w H as per Algorithm 3
7 θ n e w θ as per Algorithm 4
8Normalize W, H, and θ as per Algorithm 5
9 e r r T e m p = F ( λ ) ( W , H , θ ; X , Y )
10if  e r r <  then
11  r e l E r r | e r r e r r T e m p | / e r r
12end if
13 e r r e r r T e m p
14   i t e r i t e r + 1
15end while (
16return  W , H , θ
Algorithm 2: Updating W.
Algorithms 16 00187 i001
Algorithm 3: Updating H.
Algorithms 16 00187 i002
Algorithm 4: Updating θ .
Input:A vector Y R n × 1 , and
 a matrix W R 0 n × r
Output:A new value for θ .
1 e = ( 1 , 1 , …. , 1 ) T R n × 1
2 W ¯ = e | W
3return  W ¯ + Y
Algorithm 5: Normalization process.
Input:A matrix W R 0 n × r ,
 a matrix H R 0 r × m , and
 a vector θ R ( r + 1 ) × 1 .
Output:New values for W, H, and θ .
1 S R 0 r × 1 a vector of row sums of H.
2 S diag ( S )
3 W W S .
4 H S 1 H .
5 θ 2 : ( r + 1 ) S 1 θ 2 : ( r + 1 ) .
6return W, H, and θ .
  • W and H fixed. 
If W and H are given and only θ can vary, then Equation (2) is minimized when | | W ¯ θ Y | | 2 is minimized. If W ¯ has full rank or is overdetermined, this happens when the error, W ¯ θ Y , is orthogonal to the column span of W ¯ or that
θ = ( W ¯ T W ¯ ) 1 W ¯ T Y .
When W ¯ is underdetermined or has full rank, we require W ¯ θ = Y with | | θ | | minimized (for uniqueness) whereby
θ = W ¯ T ( W ¯ W ¯ T ) 1 Y .
See Algorithm 4. Thus, when W ¯ does not have full rank, the solution is θ = W ¯ + Y where W ¯ + is the pseudoinverse of W ¯ . See [18] for a discussion of pseudoinverses. In applications, we only expect to see overdetermined systems because the number of topics r should be less than the number of documents n .  
  • W and θ fixed.  
If W and θ are given and only H can change, then minimizing Equation (2) requires minimizing N ( W , H ; X ) . We can expand this error term out in the columns of H:
N ( W , H ; X ) = | | X W H | | F 2 = j = 1 m | | ( X W H ) : , j | | 2 = j = 1 m | | X : , j W H : , j | | 2 .
Because columnwise the terms of the sum are independent, we can minimize each column H : , j of H separately to minimize the sum, i.e.,
H : , j = arg min h R 0 r × 1 | | X : , j W h | | 2 , j = 1 , 2 , , m ,
as given in Algorithm 3. 
  • H and θ fixed.  
When H and θ are fixed, then Equation (2) can be written out as
F ( λ ) = | | X W H | | F 2 + λ | | W ¯ θ Y | | 2 = i = 1 n | | ( X W H ) i , : | | 2 + λ i = 1 n ( W ¯ θ Y ) i 2 = i = 1 n | | X i , : W i , : H | | 2 + i = 1 n λ ( θ 1 e + W i , : θ ¯ Y ) i 2
where e = ( 1 , 1 , , 1 ) T R n × 1 and θ ¯ = ( θ 2 , , θ r + 1 ) T R r × 1 . Defining matrices
X ¯ = X | λ ( θ 1 e Y )
H ¯ = H | λ θ ¯
we can rewrite Equation (15) as
F ( λ ) = i = 1 n | | X ¯ i , : W i , : H ¯ | | 2 ,
which can be minimized through
W i , : = arg min w R 0 1 × r | | X ¯ i , : w H ¯ | | 2 , i = 1 , 2 , , n .
This is precisely Algorithm 2.
For our optimizations and linear algebra, we used Numpy [19] and SciPy [20]. The nonnegative least squares routine employs an active set method to solve the least squares minimization problem with inequality constraints [21]. The active set method amounts to gradually building up the set of active constraints (those for which their not being enforced in the unconstrained problem would result in constraint violation or equality, i.e., a regression variable with a nonnegativity constraint being less than or equal to 0) and then optimizing over the passive set (all variables not in the active set) once identified with equality constraints on the active set [22]. This method can also be parallelized [23].
In addition to the steps outlined within these algorithms, we employed two additional modifications: (1) we defined ϵ = 10 10 , and any entries in H less than ϵ were replaced by ϵ (otherwise on some occasions, the W update step would fail); and (2) the minimizations at times yielded worse objective errors than already obtained and, when this happened, we did not update to the worse value.
As noted with other NMF routines, we might not reach a global minimizer [24]. In practice, the minimization should be run repeatedly with different random initializations to find a more ideal local minimum.
From an application standpoint, we wish to run the model on documents it has not been trained on. Algorithm 6 stipulates how a prediction takes place. We first find the best nonnegative decomposition of the document, a vector in R 1 × m , into the topic basis, projecting to r dimensions. With the representation in topic-coordinates, we then use the linear model.   
Algorithm 6: Prediction process.
Input:A matrix H R 0 r × m ,
 a vector θ R ( r + 1 ) × 1 ,
 and a vector x R 1 × m .
Output:Model prediction for response variable, y ^ .
1Compute w = arg min w R 0 1 × r | | w H x | | 2 .
2Compute y ^ = θ 1 + w θ 2 : ( r + 1 ) .
3return  y ^
An implementation of our algorithm can be found on our BitBucket repository, accessed on 21 February 2023.

4. Synthetic Datasets

In our synthetic data, we generate a matrix X that has nonnegative factors W and H, but we add noise. We also generate a response vector Y given as the matrix–vector product W ¯ θ with noise. We investigate four items: (1) that the method does in fact work to decrease the objective function; (2) whether the regression errors decrease with increasing λ ; (3) the effects of overfitting; and (4) the model robustness to noise.

4.1. Generating Synthetic Data

Our synthetic data generation can be summarized as follows:
  • We fix values of n = 100 , m = 40 , M = 20 , and r = 4 .
  • We then define η x = η y = 4 .
  • We pick X R n × r such that each entry is U n i f ( [ 0 , M ) ) . We likewise choose H R r × m .
  • We set X = W H .
  • We pick θ R ( r + 1 ) × 1 such that each element is U n i f ( [ M / 2 , M / 2 ) ) .
  • We set Y = W ¯ θ .
  • We perturb X with noise D X and Y with noise D Y .
  • Any negative X-entries are set to 0.
We consider two different forms for D X and D Y :
  • Being elementwise N ( 0 , η x 2 ) and N ( 0 , η y 2 ) or
  • Being elementwise U n i f ( [ 0 , η x ) ) and U n i f ( [ 0 , η y ) ) .
Note that in the synthetic data, the true number of topics is r = 4 . In testing our synthetic data, we run Algorithm 1 where τ = 10 4 and m a x I t e r = 100 . We use 70 % of the data for training and 30 % for testing.

4.2. Investigation

We confirm that the error in the objective function F ( λ ) decreases with each iteration of Algorithm 1 in Figure 1—done with Gaussian noise.
With the regression error being the mean squared prediction error, from Figure 2, we see the regression error in the training does tend to decrease with λ . (There are a few small exceptions, which we believe stem from randomizations leading to an assortment of different local optima.) The overall scale of the testing errors gets smaller as r goes from 1 to 4 and then stays steady or even gets slightly worse as r increases from 4. Indeed, r = 4 is the “correct” synthetic value. Given the noise as either Gaussian or uniform, the variances of N ( 0 , η y 2 ) , η y 2 , and U n i f ( [ 0 , η y ) ) , η y 2 / 12 , serve as loose estimates for the best possible testing loss (the loss could very well be higher as noise is added to the matrix X as well). When the training errors are smaller than this estimate, it suggests overfitting.
To study robustness with noise, we allow the level Gaussian noise to vary in the problem and evaluate the regression error on testing data. To ensure each noise level starts with the same ground truth, we start with unperturbed X and Y (as per Section 4.1 with η x = η y = 0 ) and compute matrices and vectors with the same size as X and Y with elementwise N ( 0 , 1 ) entries. Then, for each level of noise under investigation, we scale these unit normal distributions by a noise level η { 0 , 4 , 8 , 12 , 16 , 20 } and examine the testing error as the λ varies with r = 3 (rank below true rank), r = 4 (correct rank), and r = 5 (number of chosen topics above correct rank). Figure 3 illustrates the results. The noise is handled well with r = 3 in that a higher λ does tend to improve the testing error; however, at higher noise levels η , it seems suitable minima are harder to find when λ is large. With r = 4 and r = 5 , the testing does not benefit with increasing λ at any noise level.
Taken together, we anticipate that CSSNMF will perform well provided the number of topics chosen does not exceed the true number of topics in the dataset (difficult to assess). We expect that the optimal predictions on unseen data should occur at a λ large enough that the testing errors have decreased and plateaued. In Figure 2, we see that for large λ , when overfitting is an issue, the testing performance is seldom better than where λ = 0 (classical NMF and then regression) and, in fact, is often much worse.

5. Rate My Professors Dataset

Here we study our method on real data [15] coming from Rate My Professors where for each instructor in the dataset, all corresponding student comments are combined to generate a written narrative and we have the instructor’s rating (mean of all responses).

5.1. Pre-Processing

The corpus was first processed via term frequency–inverse document frequency (TF-IDF) [25] with the TfidfVectorizer class in Python’s scikit learn package [26]. We used arguments min_df=0.01, max_df=0.15, stop_words=’english’, norm=‘l1’, lowercase=True. We found the ratings were not balanced: there were 57 on the interval [1, 2), 235 on the interval [ 2 , 3 ) , 494 on the interval [3, 4), and 629 on the interval [4, 5]. To balance the dataset, we extracted only a random subset of 57 reviews in each interval (all ratings on [ 1 , 2 ) were used). Overall, we obtained a corpus matrix X that was 228 × 1635 . The open right-end of the intervals ensures data are not duplicated.

5.2. Choice of Topic Number and Regression Weight

We did not know the true number of topics in the dataset and chose topics of r = 1 , 3 , 5 , 7 , 9 , and 11, with λ { 0 } { 10 2 i / 3 | i [ 12 , 0 ] Z } . We present the results for 11 topics that gave the best results. See Figure 4. We note that, for large enough λ , the testing error outperforms the testing error for λ = 0 . The optimal point was at λ = 10 2 / 3 0.215 .
We comment that it is generally difficult to know precisely where the testing error will be minimized, only that, based on observations of the synthetic data, the testing error is often better than the λ = 0 case after the training error has dropped. We speculate that the level of noise in this dataset results in the testing errors not dropping below 0.75 .

5.3. Prediction

We examine the rating prediction by plotting histograms of predicted ratings where the true ratings were in [ 1 , 2 ] , [ 2 , 3 ] , [ 3 , 4 ] , and [ 4 , 5 ] —the closed intervals are used here. Figure 5 depicts these histograms along with the mean predicted rating and true rating. The predictions are often within range, and the mean predicted values are very close to the true means over each interval. We can also see the general predictive strength in the scatterplot of actual vs. predicted ratings in Figure 6.
These results suggest the model is able to identify topics and associated θ -weights so as to generate predictions that are consistent with true ratings. For example, in the case where ratings are in [ 1 , 2 ] , we see the peak of the predictions is around 2, not exceeding 4, with some predictions as low as 2 ; then, in the case of ratings in [ 4 , 5 ] , the model peaks around 3.5 and makes some predictions above 7. There is a clear capacity for the topics to shift the predictions.

5.4. Topics Identified

It is important that the method not only have predictive power, but also produce interpretable topics. We now look at the 11 topics with their associated θ -weights. We find
θ = ( 2.39909812 , 2.82948873 , 2.21028471 , 1.83876976 , 4.77504984 ,     3.86467795 , 3.46353642 , 0.03914383 , 3.26619842 , 5.51595505 ,     4.15317532 , 3.90733652 ) T .
Note that θ 1 2.4 suggests that for a set of reviews with no topics, the average rating would be around 2.4 —this suggests it is the presence of positive/negative topics that raise/lower the rating.
In Figure 7 and Figure 8, we plot the words in the topics associated with positive and negative ratings. The topics are interpretable. For positive topics, we find Topic 10 (suggestive of extra credit), Topic 6 (suggestive of being inspiring), Topic 8 (suggestive of being approachable/brilliant), and Topic 11 (suggestive of being nice/enjoyable class) and words such as “recommend” in a couple of them. A few words seem out of place, such as “hate” in Topic 11, but that can be explained by some positive reviews having phrases such as “i hated chemistry in high school and after taking her class i don t [sic] hate chem as much.” Among the negative topics, we see Topic 4 (being horrible) and Topic 5 (being unfair).
As a whole, the topics are consistent with intuitive notions of what would be associated with higher or lower ratings. It is also interesting to look at the θ -topic weights quantitatively. For example, Topic 2 (mentioning rants and sarcasm) and Topic 4 (suggestive of being harder and failing students) contribute negatively to the score, but being a harder teacher seems to contribute more negatively to the rating than ranting. To elaborate more: because the row sums of the corpus matrix X and those of the topic matrix H sum to 1, we have that the sum of the topic weights for each document are approximately 1 as per Remark 1. All topics exist on the “same scale” within a document (roughly in [ 0 , 1 ] ) and must roughly sum to 1; therefore, the sizes of the weights in θ for each topic can be ordered by their positive/negative contributions.
The fact that instructor difficulty has a negative effect on rating and the easiness has a positive effect has been found in another study [27] that looked at Rate My Professors data on instructor easiness and overall quality ratings. It is also interesting that many of the “dimensions” detected through our study, such as being approachable/nice (niceness), being hard/easy (difficulty), and being inspiring, were dimensions naturally identified by other scholars [28] who analyzed Rate My Professors data by hand through reading comments and classifying key phrases within the comments. In this latter study, however, each dimension could be positive or negative, depending on how it was used.

6. Conclusions and Future Work

We have developed CSSNMF as a means to combine NMF with regression on a continuous response variable. We accomplished this by minimizing an objective function that combines an NMF error with a weighted regression error. We have shown that the regression error in training is weakly decreasing with the regression error weight and that, in practical applications, the error strictly decreases. The topics identified can outperform the quantitative accuracy of topics formed through NMF alone while retaining a high degree of interpretability (as found with real data). The method is robust to noise and tends to perform best on new data when there is a dimensionality reduction, i.e., when the number of topics fit for is fewer than the true number of topics.
We expect that our methods can be applied to very large datasets as our algorithmic steps involve hierarchical alternating nonnegative least squares [16], which could even be handled in parallel [29] (with or without parallel nonnegative least squares), and a least squares problem (Equation (12)) with a small r × r matrix to invert.
Although our analysis focused on the case of linear regression, incorporating nonlinearities would be of interest. We also noted the challenge in choosing the appropriate λ given only training data. A more theoretical understanding of when testing errors drop substantially could be explored, but this may be dataset-specific.

Author Contributions

Conceptualization, M.R.L. and D.N.; methodology, M.R.L.; software, M.R.L.; validation, M.R.L.; formal analysis, M.R.L.; investigation, X.D., F.L. and A.S.; data curation, M.R.L. and D.N.; writing—original draft preparation, M.R.L.; writing—review and editing, X.D., F.L., A.S. and D.N.; visualization, M.R.L.; supervision, M.R.L.; project administration, M.R.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used have been cited and are publicly available. Our code is available at https://bitbucket.org/3k1m/cssnmf/src/master/, accessed on 21 February 2023.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chen, Y.; Zhang, H.; Liu, R.; Ye, Z.; Lin, J. Experimental explorations on short text topic mining between LDA and NMF based Schemes. Knowl.-Based Syst. 2019, 163, 1–13. [Google Scholar] [CrossRef]
  2. Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef] [PubMed]
  3. Lao, H.; Zhang, X. Regression and Classification of Alzheimer’s Disease Diagnosis Using NMF-TDNet Features From 3D Brain MR Image. IEEE J. Biomed. Health Inform. 2021, 26, 1103–1115. [Google Scholar] [CrossRef] [PubMed]
  4. Lai, Y.; Hayashida, M.; Akutsu, T. Survival analysis by penalized regression and matrix factorization. Sci. World J. 2013, 2013. [Google Scholar] [CrossRef] [Green Version]
  5. Stewart, G.W. On the early history of the singular value decomposition. SIAM Rev. 1993, 35, 551–566. [Google Scholar] [CrossRef] [Green Version]
  6. Shahnaz, F.; Berry, M.W.; Pauca, V.P.; Plemmons, R.J. Document clustering using nonnegative matrix factorization. Inf. Process. Manag. 2006, 42, 373–386. [Google Scholar] [CrossRef]
  7. Joyce, J.M. Kullback-leibler divergence. In International Encyclopedia of Statistical Science; Springer: Berlin/Heidelberg, Germany, 2011; pp. 720–722. [Google Scholar]
  8. Marler, R.T.; Arora, J.S. The weighted sum method for multi-objective optimization: New insights. Struct. Multidiscip. Optim. 2010, 41, 853–862. [Google Scholar] [CrossRef]
  9. Freijeiro-González, L.; Febrero-Bande, M.; González-Manteiga, W. A critical review of LASSO and its derivatives for variable selection under dependence among covariates. Int. Stat. Rev. 2022, 90, 118–145. [Google Scholar] [CrossRef]
  10. Austin, W.; Anderson, D.; Ghosh, J. Fully supervised non-negative matrix factorization for feature extraction. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 5772–5775. [Google Scholar]
  11. Zhu, W.; Yan, Y. Joint linear regression and nonnegative matrix factorization based on self-organized graph for image clustering and classification. IEEE Access 2018, 6, 38820–38834. [Google Scholar] [CrossRef]
  12. Haddock, J.; Kassab, L.; Li, S.; Kryshchenko, A.; Grotheer, R.; Sizikova, E.; Wang, C.; Merkh, T.; Madushani, R.; Ahn, M.; et al. Semi-supervised Nonnegative Matrix Factorization for Document Classification. In Proceedings of the 2021 55th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 31 October–3 November 2021; pp. 1355–1360. [Google Scholar]
  13. Li, P.; Tseng, C.; Zheng, Y.; Chew, J.A.; Huang, L.; Jarman, B.; Needell, D. Guided Semi-Supervised Non-negative Matrix Factorization on Legal Documents. Algorithms 2022, 15, 136. [Google Scholar] [CrossRef]
  14. Rate My Professors. Available online: https://www.ratemyprofessors.com/ (accessed on 17 February 2023).
  15. He, J. Big Data Set from RateMyProfessor.com for Professors’ Teaching Evaluation. 2020. Available online: https://doi.org/10.17632/fvtfjyvw7d.2 (accessed on 21 February 2023).
  16. Kim, H.; Park, H. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl. 2008, 30, 713–730. [Google Scholar] [CrossRef] [Green Version]
  17. Lee, D.; Seung, H.S. Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 2000, 13. [Google Scholar]
  18. Klein, C.A.; Huang, C.H. Review of pseudoinverse control for use with kinematically redundant manipulators. IEEE Trans. Syst. Man Cybern. 1983, 2, 245–250. [Google Scholar] [CrossRef]
  19. Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
  20. Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [Green Version]
  21. scipy.optimize.nnls. Available online: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.nnls.html (accessed on 17 February 2023).
  22. Bro, R.; De Jong, S. A fast non-negativity-constrained least squares algorithm. J. Chemom. J. Chemom. Soc. 1997, 11, 393–401. [Google Scholar] [CrossRef]
  23. Luo, Y.; Duraiswami, R. Efficient parallel nonnegative least squares on multicore architectures. SIAM J. Sci. Comput. 2011, 33, 2848–2863. [Google Scholar] [CrossRef]
  24. Berry, M.W.; Browne, M.; Langville, A.N.; Pauca, V.P.; Plemmons, R.J. Algorithms and applications for approximate nonnegative matrix factorization. Comput. Stat. Data Anal. 2007, 52, 155–173. [Google Scholar] [CrossRef] [Green Version]
  25. Joachims, T. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Technical Report, Carnegie-Mellon Univ Pittsburgh pa Dept of Computer Science. 1996. Available online: https://apps.dtic.mil/sti/citations/ADA307731 (accessed on 21 February 2023).
  26. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  27. Bleske-Rechek, A.; Fritsch, A. Student Consensus on RateMyProfessors Com. Pract. Assess. Res. Eval. 2011, 16, 18. [Google Scholar]
  28. Hartman, K.B.; Hunt, J.B. What ratemyprofessors. com reveals about how and why students evaluate their professors: A glimpse into the student mind-set. Mark. Educ. Rev. 2013, 23, 151–162. [Google Scholar]
  29. Moon, G.E.; Ellis, J.A.; Sukumaran-Rajam, A.; Parthasarathy, S.; Sadayappan, P. ALO-NMF: Accelerated locality-optimized non-negative matrix factorization. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 1758–1767. [Google Scholar]
Figure 1. Illustration of decreasing objective function at r = 3 topics for λ { 0 } { 10 i / 2 | i Z [ 2 , 2 ] }.
Figure 1. Illustration of decreasing objective function at r = 3 topics for λ { 0 } { 10 i / 2 | i Z [ 2 , 2 ] }.
Algorithms 16 00187 g001
Figure 2. Regression errors with varying regression weight λ with different numbers of topics r for Gaussian noise (ag) and uniform noise (hn). The λ values used are the set { 0 } { 10 i / 2 | i Z [ 8 , 8 ] } . For each λ and r, fifty trials were run and the regression errors corresponding to the best overall objective function F ( λ ) were recorded. Points with λ > 0 for which the regression error exceeds 1.5 times the regression error at λ = 0 are not displayed. The dashed horizontal line is the estimated minimal mean regression error. The dashed vertical line is the transition point between a linear and logarithmic x-scale.
Figure 2. Regression errors with varying regression weight λ with different numbers of topics r for Gaussian noise (ag) and uniform noise (hn). The λ values used are the set { 0 } { 10 i / 2 | i Z [ 8 , 8 ] } . For each λ and r, fifty trials were run and the regression errors corresponding to the best overall objective function F ( λ ) were recorded. Points with λ > 0 for which the regression error exceeds 1.5 times the regression error at λ = 0 are not displayed. The dashed horizontal line is the estimated minimal mean regression error. The dashed vertical line is the transition point between a linear and logarithmic x-scale.
Algorithms 16 00187 g002
Figure 3. Regression errors at various noise levels η with varying regression weight λ . The true number of topics is r = 4 . The top row, (ac), depicts the training errors, and the bottom row, (df), depicts the testing errors. The first column, (a,d), is for a low rank approximation r = 3 , the second column, (b,e), is for an approximation of correct rank r = 4 , and the third column, (c,f), is for when the number of topics used r = 5 is larger than the true number of topics. For each λ , η , and r, fifty trials were run, and the regression errors corresponding to the best overall objective function F ( λ ) were recorded. Points with λ > 0 for which the regression error exceeds 1.5 times the regression error at λ = 0 are not displayed. The dashed vertical line is the transition point between a linear and logarithmic x-scale.
Figure 3. Regression errors at various noise levels η with varying regression weight λ . The true number of topics is r = 4 . The top row, (ac), depicts the training errors, and the bottom row, (df), depicts the testing errors. The first column, (a,d), is for a low rank approximation r = 3 , the second column, (b,e), is for an approximation of correct rank r = 4 , and the third column, (c,f), is for when the number of topics used r = 5 is larger than the true number of topics. For each λ , η , and r, fifty trials were run, and the regression errors corresponding to the best overall objective function F ( λ ) were recorded. Points with λ > 0 for which the regression error exceeds 1.5 times the regression error at λ = 0 are not displayed. The dashed vertical line is the transition point between a linear and logarithmic x-scale.
Algorithms 16 00187 g003
Figure 4. Errors in training and validation on Rate My Professor dataset with r = 11 topics. Points with λ > 0 for which the regression error exceeds 1.5 times the regression error at λ = 0 are not displayed. The dashed vertical line is the transition point between a linear and logarithmic x-scale.
Figure 4. Errors in training and validation on Rate My Professor dataset with r = 11 topics. Points with λ > 0 for which the regression error exceeds 1.5 times the regression error at λ = 0 are not displayed. The dashed vertical line is the transition point between a linear and logarithmic x-scale.
Algorithms 16 00187 g004
Figure 5. Histograms of the predicted mean rating for various ranges of true ratings ([1, 2] in (a), [2, 3] in (b), [3, 4] in (c), and [4, 5] in (d)). The vertical dashed lines represent the mean values. The predicted and true means are as follows: 2.206 and 1.543 for ratings in [1, 2], 3.233 and 2.529 for ratings in [2, 3], 3.594 and 3.593 for ratings in [3, 4] (the lines are indistinguishable), and 4.576 and 4.494 for ratings in [4, 5].
Figure 5. Histograms of the predicted mean rating for various ranges of true ratings ([1, 2] in (a), [2, 3] in (b), [3, 4] in (c), and [4, 5] in (d)). The vertical dashed lines represent the mean values. The predicted and true means are as follows: 2.206 and 1.543 for ratings in [1, 2], 3.233 and 2.529 for ratings in [2, 3], 3.594 and 3.593 for ratings in [3, 4] (the lines are indistinguishable), and 4.576 and 4.494 for ratings in [4, 5].
Algorithms 16 00187 g005
Figure 6. Scatterplot of Rate My Professor ratings: actual mean rating vs. predicted mean rating.
Figure 6. Scatterplot of Rate My Professor ratings: actual mean rating vs. predicted mean rating.
Algorithms 16 00187 g006
Figure 7. Topics with positive θ -weights. The θ -weight is given as the topic weight. The strength of each word is given numerically beside each of the top 10 words.
Figure 7. Topics with positive θ -weights. The θ -weight is given as the topic weight. The strength of each word is given numerically beside each of the top 10 words.
Algorithms 16 00187 g007
Figure 8. Topics with negative θ weights. The θ weight is given as the topic weight. The strength of each word is given numerically beside each of the top 10 words.
Figure 8. Topics with negative θ weights. The θ weight is given as the topic weight. The strength of each word is given numerically beside each of the top 10 words.
Algorithms 16 00187 g008
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lindstrom, M.R.; Ding, X.; Liu, F.; Somayajula, A.; Needell, D. Continuous Semi-Supervised Nonnegative Matrix Factorization. Algorithms 2023, 16, 187. https://doi.org/10.3390/a16040187

AMA Style

Lindstrom MR, Ding X, Liu F, Somayajula A, Needell D. Continuous Semi-Supervised Nonnegative Matrix Factorization. Algorithms. 2023; 16(4):187. https://doi.org/10.3390/a16040187

Chicago/Turabian Style

Lindstrom, Michael R., Xiaofu Ding, Feng Liu, Anand Somayajula, and Deanna Needell. 2023. "Continuous Semi-Supervised Nonnegative Matrix Factorization" Algorithms 16, no. 4: 187. https://doi.org/10.3390/a16040187

APA Style

Lindstrom, M. R., Ding, X., Liu, F., Somayajula, A., & Needell, D. (2023). Continuous Semi-Supervised Nonnegative Matrix Factorization. Algorithms, 16(4), 187. https://doi.org/10.3390/a16040187

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop