Keywords

1 Introduction

The aim of dimensionality reduction (DR) is to obtain lower dimensional representations of high-dimensional input data keeping -under a pre-established criterion- the structure of data as well as possible. Reaching this aim, entails both the performance of a pattern recognition system and intelligible data representation can be improved [1]. Traditionally, DR methods are designed by following pre-established optimization criteria and design parameters. But they mostly lack of properties like interactivity and controllability, being important characteristics of the field of Information Visualization (InfoVis) [2]. InfoVis provides interfaces and graphical ways of representing data making the available information more usable and intelligible for the user. However, it turns out that DR outcomes can be enhanced by taking advantages of some properties of InfoVis methods [3, 4]. Following this premise, some approaches have proposed [5, 6], making use of interactivity with equalizer-bar like interfaces or geometric interaction models. In general, such approaches implement interesting interactive models but their final visualization lacks the information about structure of the data from the original input space -at least in an easy to understand and/or visual way-.

In this work, we introduce a new visualization approach using an interactive mixture of data representations resultant from DR methods. After performing the DR methods on the input data, a set of lower-dimensional representation spaces are obtained. Particularly, the mixture is done via a weighted sum. In order to give users a sense of the structure of data, we implement a data-driven visualization in addition to the conventional scatter plot. Such a visualization captures the structure of the input data by using a similarity matrix (as well, affinity matrix from graph theory), which captures the degree of similarity or affinity between every pair of data points. The visualization consists of plotting lines (edges) between data points exhibiting the highest value of similarity. Additionally, to provide more sense of interactivity, user can control the number of edges by a varying parameter -working as a slider bar within an interface-. By design, affinity is selected as a Gaussian one so that the structure of local neighbor points can be taken into account. Particularly, low-dimensional spaces are obtained by the state of the art of methods such as: Classical Multidimensional Scaling (CMDS) [2], Laplacian Eigenmaps (LE) [7], Locally Linear Embedding (LLE) [8], Stochastic Neighbor Embedding (SNE), and t-Student-distributed-SNE (t-SNE) [1, 7]. To perform the mixture, user can set the weighting factors by picking up values from a equalizer-bar-like interface. To test our visualization approach, we use a 3D artificial spherical shell data set. The quality of resultant representation spaces is quantified by a scaled version of the average agreement rate between K-ary neighborhoods [9]. The proposed mixture may represent every single dimensionality reduction approach as well as it helps users to find a suitable representation of input data within a visual and friendly user interface.

The remaining of the paper is organized as follows: In Sect. 2, Data visualization via dimensionality reduction is outlined. Section 3 introduces the proposed interactive data visualization scheme. Experimental setup and results are presented in Sects. 4 and 5, respectively. Finally, Sect. 6 gathers some final remarks as conclusions and future work.

2 Data Visualization via Dimensionality Reduction

Perhaps, one of the most intuitive ways of visualizing numerical data is through a 2- or 3-dimensional representation of original data, which can be readily represented using a scatter plot. In consequence, dimensionality reduction arises as an Correspondingly, DR is aiming at reaching a low-dimensional data representation, upon which both the classification task performance is improved in terms of accuracy, as well as the intrinsic nature of data is properly represented [10]. So, when performing a DR method, a more realistic and intelligible visualization for the user is expected [1]. More technically, the goal of dimensionality reduction is to embed a high dimensional data matrix \({\varvec{Y}}=[{\varvec{y}}_{i}]_{1\le 1 \le N}\) such that \({\varvec{y}}_{i} \in \mathbb {R}^{D}\) into a low-dimensional, latent data matrix \({\varvec{X}}=[{\varvec{x}}_{i}]_{1\le 1 \le N}\) being \({\varvec{y}}_{i} \in \mathbb {R}^{d}\), where \(d<D\) [1, 11]. Figure 1 depicts an instance where a manifold, so-called 3D spherical shield, is embedded into a 2D representation, which resembles to an unfolded version of the original manifold.

Fig. 1.
figure 1

Dimensionality reduction effect over an artificial (3-dimensional) spherical shell manifold. Resultant embedded (2-dimensional) data is an attempt to unfolding the original data.

3 Interactive Data Visualization Scheme

The proposed visualization approach, here called DataVisSim, involves three main stages: mixture of DR outcomes, interaction, and visualization, as depicted in the block diagram of Fig. 2. One of the most important contributions of this work is that information on the structure of the input high-dimensional space is added to the visual final representation, by using a pairwise-similarity-based scheme.

Fig. 2.
figure 2

Block diagram of proposed interactive data visualization using dimensionality reduction and similarity-based representations (DataVisSim). Roughly speaking, it works as follows: first performs a mixture of resultant lower-dimensional representation spaces by taking advantage of conventional implementations of traditional DR methods. The interaction is provided through a interface that enables user to dynamically input the weighting factors for the aforementioned mixture. For visualization, a novel similarity-based approach is used.

3.1 Mixture

Let us suppose that the input matrix \({\varvec{Y}}\) is reduced by using M different DR methods, yielding then a set of lower-dimensional representations: \(\{{\varvec{X}}^{(1)},\cdots ,{\varvec{X}}^{(M)}\}\). Herein, we propose to perform a weighted sum in the form:

$$\begin{aligned} \bar{{\varvec{X}}}= \sum _{m=1}^{M}\alpha _{m}{\varvec{X}}^{m}, \end{aligned}$$
(1)

where \(\{ \alpha _{1},\cdots ,\alpha _{M} \}\) are the weighting factors. To make the selection of weighting factors intuitive, we use probability values so that \(0 \le \alpha _{m} \le 1\) and \(\sum _{m1=1}^{M}\alpha _{m}=1\), and therefore all matrices \({\varvec{X}}^{(m)}\) should be normalized to rely within a hypersphere of ratios.

3.2 Interaction Model

For the sake of interactivity, the values of every \(\alpha _{m}\), required to calculate \(\bar{{\varvec{X}}}\) according to Eq. (1), are to be defined by the users using an equalizer-bar available in the interface. Within a friendly-user and intuitive environment, weighting factors can be readily inputted by just picking up values from bars. In order to provide quick views of resultant representation space, as soon as a point is picked up the remaining ones are automatically completed following a uniform density probability function. The same is done in case than more than one value is selected.

3.3 Similarity-Based Visualization

The most used method to visualize 2- or 3-dimensional data is the scatter plot. In this work, we introduce a similarity-based visualization approach with the aim to provide a visual hint about the structure of the high-dimensional input data matrix \({\varvec{Y}}\) into the scatter plot of its representation in a lower-dimensional space To do so, we use a pairwise similarity matrix \({\varvec{S}} \in \mathbb {R}^{N\times N}\), such that \({\varvec{S}}=[{\varvec{s}}_{ij}]\). In terms of graph theory, entries \({\varvec{s}}_{ij}\) defines the similarity or affinity between the i-th and j-th data point from \({\varvec{Y}}\). Doing so, we can hold the structure of original input space in a topological fashion, specifically in terms of pairwise relationships. For visualization purposes, such a similarity is used to define graphically the relationship between data points by plotting edges. In order to control the amount of edges and make an appealing visual representations, the value of \({\varvec{s}}_{ij}\) is constrained as \({\varvec{s}}_{ij}>{\varvec{s}}_{max}\), being \({\varvec{s}}_{max}\) a maximum admissible similarity value to be given by the users as well. In other words, our visualization approach consists of building a graph with constrained affinity values.

4 Experimental Setup

Database: In order to visually evaluate the performance of the DataVisSim approach, we use an artificial spherical shell (N = 1500 data points and D = 3), as depicted in Fig. 1.

Parameter Settings and Methods: In order to capture the local structure for visualization, i.e. data points being neighbors, we utilize the Gaussian similarity given by: \({\varvec{s}}_{ij}=-exp(-0.5||{\varvec{y}}_{(i)}-{\varvec{y}}_{(j)}||^{2}/\sigma ^{2})\). The parameter is a bandwidth value set as 0.1, being the 10% of the hypersphere ratio (applicable once matrices are normalized as discussed in Sect. 3.1. To perform the dimensionality reduction we consider \(M = 5\) DR methods, namely: CMDS, LE, LLE, SNE, and t-SNE. All of them are intended to obtain spaces in dimension \(d=2\).

Performance Measure: To quantify the performance of studied methods, the scaled version of the average agreement rate \(R_{NX}(K)\) introduced in [9] is used, which is ranged within the interval [0, 1]. Since \(R_{NX}(K)\) is calculated at each perplexity value from 2 to \(N-1\), a numerical indicator of the overall performance can be obtained by calculating its area under the curve (AUC). The AUC assesses the dimension reduction quality at all scales, with the most appropriate weights.

Fig. 3.
figure 3

The effects of dimensionality reduction of RD methods considered on the 3D sphere. The results are embedded data represented in a bidimensional space.

5 Results and Discussion

Figure 3 shows the scatter plots for the resultant low-dimensional spaces obtained by the considered dimensionality reduction methods, as well as the performed mixture. Quality curves and corresponding scatter of each mixture are shown in Figs. 4, 5, 6, 7 and 8. As seen, \(R_{NX}(K)\) measure allows for assessing both the different mixtures and the RD methods independently. Since the area under its curve represents a quality measure of the low-dimensional space, is in turn a visual and intuitive indicator that helps the user to find the best either a single DR method or the proper mixture.

Fig. 4.
figure 4

(a) Performance of the mixture 1 and all methods deemed RD. In (b) the embedded data resulting from mixture 1 are indicated.

Fig. 5.
figure 5

(a) Performance of the mixture 2 and all methods deemed RD. In (b) the embedded data resulting from mixture 2 are indicated.

Fig. 6.
figure 6

(a) Performance of the mixture 3 and all methods deemed RD. In (b) the embedded data resulting from mixture 3 are indicated.

Fig. 7.
figure 7

(a) Performance of the mixture 4 and all methods deemed RD. In (b) the embedded data resulting from mixture 4 are indicated.

Fig. 8.
figure 8

Performance of all selected mixtures.

Fig. 9.
figure 9

View of the DataVisSim interface implemented on processing software (https://sites.google.com/site/intelligentsystemsrg/home/gallery).

To test the DataVis approach, we implement an interface on Processing software, which allows to easily code visual arts. Then, it results appealing for creating visual analytics interfaces. Figure 9 shows a view of the implemented interface. For the sake of easily handling so that (even non-expert) users may interact with DR methods and their feasible combinations in an intuitive manner using equalizer-like bars. This is possible because of resultant data representations are properly set according to the human perception. As well, the interface incorporates a slider bar to dynamically draw the edges between nodes. This is useful for visual analysis given that it allows to relate the structure of high-dimensional data (original data) within the visualization of the low-dimensional representation space. Therefore, it is provided a powerful tool for making decisions of the most suitable representation of the original data, in other words, the most proper DR methods.

6 Conclusions and Future Work

This work presents a new interactive data visualization approach based on mixture of the outcomes of dimensionality reduction (DR) methods. The core of this approach consists of plotting lines (edges) between data points exhibiting the highest value using a similarity matrix which measure the degree of similarity or affinity between every pair of data points capturing the structure of the input data. Such visualization of a topology can be represented by a data-driven graph in addition to the conventional scatter plot, to provide more sense of interactivity to the user for selecting and/or combining DR methods while providing information about the structure of original data. Correspondingly, data points represent the nodes and an affinity matrix holds the pairwise edge weights. As a future work, other dimensionality reduction methods are to be integrated into data-driven graph, so that a good trade between preservation of data structure and intelligible data visualization can be reached. More mathematical properties will be explored to design data-driven schemes that best approximate the topology data.