Abstract
Uncertainty quantification approaches have been more critical in large language models (LLMs), particularly high-risk applications requiring reliable outputs. However, traditional methods for uncertainty quantification, such as probabilistic models and ensemble techniques, face challenges when applied to the complex and high-dimensional nature of LLM-generated outputs. This study proposes a novel geometric approach to uncertainty quantification using convex hull analysis. The proposed method leverages the spatial properties of response embeddings to measure the dispersion and variability of model outputs. The prompts are categorized into three types, i.e., ’easy’, ’moderate’, and ’confusing’, to generate multiple responses using different LLMs at varying temperature settings. The responses are transformed into high-dimensional embeddings via a BERT model and subsequently projected into a two-dimensional space using Principal Component Analysis (PCA), Isomap, Multidimensional Scaling (MDS). The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is utilized to cluster the embeddings and compute the convex hull for each selected cluster. The experimental results indicate that the uncertainty of the model for LLMs depends on the prompt complexity, the model, and the temperature setting.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In recent years, with advanced computing technologies, the field of Natural Language Processing (NLP) has improved significantly, leading to the development of sophisticated large language models (LLM), such as GPT-3.5/4, Gemini, LLama, and others [1, 2]. These LLMs have increasingly been used as powerful tools for NLP applications, and demonstrated significant performance in generating contextually relevant text and supporting a variety of applications, such as chatbots, automated content generation, virtual agents, text categorization, language translation, and many more [3,4,5].
In the literature, the number of studies in NLP and LLMs has dramatically increased after introducing the Generative Pre-trained Transformer 3 (GPT-3) by OpenAI in 2020, which was the most sophisticated neural network by that time [6]. It could create enhanced textual, visual, and auditory content without human intervention [7]. However, there are serious concerns regarding LLMs’ responses in terms of reliability and uncertainty. The authors in ref. [8] highlight the importance of Uncertainty Quantification (UQ) methods in reducing uncertainties in optimization and decision-making processes with traditional methods, such as Bayesian approximation and ensemble learning. The authors in ref. [9] emphasize the role of uncertainty estimation, particularly epistemic uncertainty, for LLMs, and conduct different techniques, i.e., Laplace Approximation, Dropout, and Epinets. They indicate that uncertainty plays a crucial role in developing LLM-based agents for decision making, and its information can significantly contribute to the overall performance of LLMs. However, these traditional approaches have difficulties dealing with the complex and high-dimensional characteristics of texts generated by LLMs. Therefore, novel approaches are needed to effectively capture the uncertainty in LLMs’ responses and enhance their trustworthiness [10]. The authors in ref. [11] conduct an early exploratory study of uncertainty measurement with twelve uncertainty estimation methods to help characterize the prediction risks of LLMs, as well as to investigate the correlations between LLMs’ prediction uncertainty and their performance. Their results indicate that there is a high potential for uncertainty estimation to identify the inaccurate or unreliable predictions generated by LLMs. To quantify the uncertainty of LLM, the authors propose two metrics, i.e., Verbalized Uncertainty and Probing Uncertainty in ref. [12]. Verbalized uncertainty employs prompting the LLM to express its confidence in its explanations, whereas probing uncertainty employs sample and model perturbations to quantify the uncertainty. According to the results, the probing uncertainty offers a better performance than verbalized uncertainty with lower uncertainty corresponding to explanations with higher faithfulness. The study [13] presents a framework, called Rank-Calibration, to assess uncertainty and confidence measures for LLMs. The framework also provides a comprehensive robustness analysis and interpretability for LLMs. The other study [14] also presents a framework to improve the reliability of LLMs, called Uncertainty-Aware In-context Learning. It involves fine-tuning the LLM using a calibration dataset, and evaluates the model’s knowledge by analyzing multiple responses to the same query to determine if a correct answer is present.
To address the gap, in this paper, a convex hull-based geometric approach is proposed for UQ of LLMs. The main contributions of this study are (1) Introduce a geometric approach to UQ in LLMs’ responses, (2) Implement a comprehensive system processing a wide range of prompts, and generating responses using different models and temperature settings, and (3) Highlight the relationship between the prompt complexity, model settings, and uncertainty.
2 Related work
In recent years, uncertainty quantification (UQ) has been one of the challenging topics in the context of AI/ML concept, particularly critical applications using LLMs in healthcare, finance and law [15,16,17,18,19]. Uncertainty, which refers to the confidence of the model in its predictions, can originate from different sources, such as model architecture, model parameters, noise, and insufficient information in the dataset(s) [20,21,22,23] in addition to the nature of LLMs. Training data can also be an influential source of uncertainty due to the complexity and diversity of the selected dataset(s). Furthermore, uncertainty should be addressed individually for each domain based on its specific requirements and constraints. For example, data sensitivity, privacy, structure, and ethical considerations differ between healthcare and finance applications. Therefore, it is essential to adapt UQ methods to the decision-making processes of each critical application based on domain-specific factors to mitigate the uncertainty of LLMs. The study [8] provides a comprehensive overview of the UQ methods. Regarding UQ, there are three widely used methods, i.e., confidence-based methods [24], ensemble methods, and Bayesian methods. Confidence-based methods evaluate the reliability of model outputs using entropy, probability, calibration, and ensemble approaches. Ensemble methods employ multiple models to calculate the uncertainty estimation [25], while Bayesian approximation methods use computational methods [26], such as Monte Carlo. The authors in ref. [27] propose a method called NIRVANA (uNcertaInty pRediction ValidAtor iN Ai) to investigate the correlation between uncertainty and the performance of a deep learning model. Experimental results show that uncertainty quantification is negatively correlated with the model’s prediction performance. Additionally, adjusting dropout ratios effectively reduces the uncertainty of correct predictions while increasing the uncertainty of incorrect ones. The study [28] provides a comprehensive survey of the mathematical and computational foundations of model order reduction (MOR) techniques for UQ problems with distributed uncertainties. These MOR techniques are applied to both forward and inverse UQ problems, resulting in significant computational reductions in the statistical evaluation of many-query scenarios, as well as in real-time applications for rapid Bayesian inversion. Another study [29] provides an overview of recent advances in the interdisciplinary field that combines data assimilation (DA), uncertainty quantification (UQ) and machine learning (ML) to address critical challenges in high-dimensional dynamical systems, such as system identification, reduced-order surrogate modeling, error covariance specification, and model error correction. It focuses on how existing limitations of DA and UQ can be mitigated by ML methods, and vice versa, and aims to assist ML scientists in improving model accuracy and interoperability and DA/UQ experts in integrating ML techniques into DA and UQ processes.
This section provides a brief overview of the existing literature on UQ methods applied to NLP and LLMs and provides an analysis of responses generated for a selected confusing prompt.
In the literature, there are many studies focusing on developing uncertainty quantification techniques in the context of NLP and LLMs. The study [30] offers the use of dropout as a Bayesian approximation to estimate uncertainty in deep learning-based models. This method, known as Monte Carlo dropout, involves applying dropout during both training and inference to generate multiple stochastic forward passes and approximate the posterior distribution of the model’s predictions. The research explores the use of pre-trained language models, such as BERT [31] and GPT-3 [32], for UQ. Furthermore, most UQ studies have limited capabilities and applications, i.e., short text generation. To bridge this gap, the authors in ref. [10] indicate the limitations of existing methods for long text generation, and propose a novel UQ approach, called Luq-Ensemble (improved version of LUQ), which essembles responses from multiple models and selects the response with the least uncertainty. Based on the results, the proposed approach demonstrates better performance compared to traditional methods and improves the factual accuracy of the responses compared to the best individual LLM. In addition to increasing the number of applications using LLMs and high-reliability expectations, novel UQ methods and benchmarks for comprehensive evaluation strategies must be developed to understand the uncertainty of LLMs to be able to help researchers and engineers towards more robust and reliable LLMs. Fortunately, several studies propose benchmarks to address this. The authors in ref. [33] develop a benchmark for LLMs involving UQ with an uncertainty-aware evaluation metric, called UAcc for both prediction accuracy and uncertainty. The developed benchmark consists of eight LLMs (LLM series) spanning five representative natural language processing tasks. The results indicate several findings related to LLM uncertainty, i.e., higher accuracy may tend to lower certainty, and instruction fine-tuning may increase the uncertainty of LLMs.
Despite these advancements, the proposed methods may not fully capture models’ underlying uncertainties for LLMs due to several reasons mentioned earlier. Evaluating the performance of LLMs has been essential for the development and deployment of robust LLMs. It makes the need for novel UQ approaches more critical. One of the possible approaches is the geometric approach to UQ, particularly using convex hull analysis. This method has received limited attention in the literature. This study builds on the foundation of these existing methods and introduces a novel approach that leverages the geometric properties of response embeddings to measure uncertainty in LLM outputs.
Table 1 presents an analysis of the responses generated for a strategically chosen confusing prompt: “Is a broken clock right if it tells the right time twice a day?”. This prompt, carefully selected for its potential to produce a wide range of responses, is used to evaluate the variability and spread of responses at different temperature settings, i.e., 0.25, 0.5, 0.75, and 1.0, using the convex hull area as a metric.
The table includes three responses, Response 1, Response 2, and Response 3, respectively. Response 1 discusses the paradox of a rule that states all rules have exceptions, arguing that such a rule cannot be valid since it would mean that at least one rule does not have an exception. Response 2 expands on the logical inconsistency if the rule itself is an exception, suggesting that some rules must not have exceptions, which contradicts the original assertion. Response 3 is similar to Response 1, addressing the self-contradiction inherent in the statement.
In this case, the convex hull area is measured for each response (Response 1, 2, and 3) at four different temperature settings (0.25, 0.5, 0.75, and 1.0). These temperatures control the randomness in the response generation process. The convex hull area values are 1.3122 at 0.25, 5.0236 at 0.5, 6.5476 at 0.75, and 5.7562 at 1.0. These values indicate the spread and variability of the responses, which can be interpreted as follows:
-
A larger convex hull area suggests greater variability, which, in the context of this study, can be indicative of increased uncertainty or diversity in the generated responses.
On the other hand, it can be considered a threshold to limit or maximize the convex hull area. As seen in the table, the areas are close to each other for high-temperature settings, i.e., 0.5, 0.75, and 1.0. It is between 5.0236 and 6.5476.
Table 1 also provides us with a brief analysis of how the model’s responses change at different temperature settings for confusing prompts. Traditional methods can still evaluate the uncertainty of model outputs. However, they have difficulties when applied to LLMs, and novel proposed approaches can be adopted into these traditional techniques. To address this requirement, in this study, a novel geometric approach to UQ using convex hull analysis is proposed for the uncertainty of LLMs responses along with several use cases.
3 System model
This section presents the proposed approach to calculate the uncertainty of the generated responses based on the geometric properties of convex hull areas for each prompt. The system processes a comprehensive set of categorized prompts, meticulously focusing on three distinct types: ’easy’, ’moderate’, and ’confusing’. The responses are transformed into high- dimensional embeddings via a BERT model and subsequently projected into a two-dimensional space using dimensionality reduction techniques such as PCA, Isomap, MDS, and then clustered by the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm.
The overview of the system is shown in Fig. 1. The figure illustrates the workflow of the model, from the input prompt to the calculation of the convex hull area, providing a measure of uncertainty. The process begins with a given prompt (e.g., “Explain the process of photosynthesis in detail.”) and a specified temperature setting, which is fed into a LLM. The model generates multiple responses to the prompt, each of them provides a potentially different elaboration on the topic. These responses are then encoded into high-dimensional embeddings using a BERT model. The embeddings, initially in a high-dimensional space, are projected onto a two-dimensional space using Principal Component Analysis (PCA) for easier visualization and clustering. The 2D embeddings are clustered using the DBSCAN algorithm, which identifies groups of similar responses. For each cluster, a convex hull is computed, representing the smallest convex boundary that encloses all points in the cluster. The area of each convex hull is calculated, with the total area serving as a measure of the uncertainty of the model’s responses to the given prompt. The example shown includes a plot of the 2D embeddings with the convex hull of the densest cluster highlighted, illustrating the spatial distribution and clustering of the responses.
Let \({\mathcal {P}} = \{p_1, p_2, \ldots , p_n\}\) denote the set of prompts, which are rigorously categorized into three types: ’easy’, ’moderate’, and ’confusing’. For each prompt \(p \in {\mathcal {P}}\), it is generated a diverse set of responses \({\mathcal {R}}(p) = \{r_1, r_2, \ldots , r_m\}\) using a suite of different language models \({\mathcal {M}} = \{m_1, m_2, \ldots , m_k\}\) at varying temperature settings \({\mathcal {T}} = \{t_1, t_2, \ldots , t_l\}\).
The embeddings for each response \(r \in {\mathcal {R}}(p)\) are computed utilizing a pre-trained BERT model, represented mathematically as:
where \({\textbf{E}}(r) \in {\mathbb {R}}^d\) denotes the embedding vector in a \(d\)-dimensional space, encapsulating the semantic content of the response in a high-dimensional feature space.
Given the set of embeddings \({\textbf{E}}({\mathcal {R}}(p)) = \{{\textbf{E}}(r_1), {\textbf{E}}(r_2), \ldots , {\textbf{E}}(r_m)\}\), it is applied into one of the selected dimensionality reduction techniques (RD), i.e., PCA, Isomap, and MDS, to reduce the dimensionality of the embedding vectors to 2, enabling effective visualization and clustering:
where \({\textbf{E}}_\text {RD}(r) \in {\mathbb {R}}^2\), the transformation is achieved by projecting the original high-dimensional embeddings onto a two-dimensional subspace that maximizes the variance, thereby preserving the most significant features of the data.
Subsequently, the DBSCAN algorithm is utilized to cluster the reduced embeddings to identify distinct groupings within the response space:
where \({\textbf{L}}\) represents the set of cluster labels assigned to each embedding point, with \(\epsilon\) controlling the maximum distance between points in the same cluster and \(\text {min\_samples}\) specifying the minimum number of points required to form a cluster.
For each identified cluster \(c \in {\textbf{L}}\), excluding noise points (i.e., \(c = -1\)), we compute the convex hull and its corresponding area, which encapsulates the geometric boundary of the cluster:
The convex hull is the smallest convex polygon that can enclose all the points in the cluster, and the area of the convex hull is a measure of the spatial extent of the cluster.
The total convex hull area for a given prompt \(p\) at temperature \(t\) is then defined as the summation of the areas of all clusters:
This metric provides an aggregate measure of the uncertainty and dispersion of the model’s responses to the prompt, with larger areas indicating higher uncertainty.
The algorithm for calculating the uncertainty using convex hull areas is outlined in Algorithm 1. This algorithm takes as input a prompt, a temperature setting, and a model, computes embeddings, reduces their dimensionality, performs clustering, and finally computes and returns the convex hull areas.
In addition, there are still concerns regarding the computational costs of calculating convex hulls. These costs are significantly dependent on both the number of input points and the dimensionality of the space. Fortunately, with advanced computing technologies like GPUs and improved algorithms, the computational performance of convex hull calculations has been enhanced. The authors in ref. [34] proposed a hybrid algorithm to address this issue by using a GPU-based filter to reduce the number of points and then utilizing a CPU to compute the final convex hull from the remaining points. The results indicate that the proposed hybrid algorithm significantly improves performance, i.e., a 10–27 times speedup for static point sets and a 22–46 times speedup for deforming point sets. In this study, the convex hull area is calculated in Python based on the 2D array (a fixed 768 items in an array projected from BERT embeddings with 30 responses for each prompt), which are clustered using the DBSCAN algorithm. Therefore, in this study, the computational cost associated with calculating convex hulls is not a concern in terms of computational expense and is not dependent on the complexity or length of the given prompt.
4 Experimental results and discussion
This section provides experimental results obtained by the convex hull-based UQ method using different dimensionality reduction techniques, i.e., PCA (Principal Component Analysis), MDS (Multidimensional Scaling), and Isomap for the selected LLMs, and a discussion on these results.
4.1 Experimental results
The experiments were conducted using three types of prompts: ’easy’, ’moderate’, and ’confusing’, with responses generated by three different models (Gemini-Pro, GPT-3.5-turbo, and GPT-4o) at various temperature settings (0.25, 0.5, 0.75, and 1.0). The results are presented in Fig. 2a–c, which show the uncertainty values (convex hull areas) plotted against the temperature settings for each model and prompt type across different dimensionality reduction techniques. Table 2, 3, and 4 provide the statistical details, including the mean and standard deviation (std), for each model and prompt type at different temperature settings.
Fig. 2a–c shows convex hull areas and trends for each dimensionality reduction technique, i.e., PCA, Isomap, and MDS, based on the temperature setting, model and prompt type. In this figure, the x-axis represents the temperature setting, while the y-axis shows the Convex Hull Area. Figure 2a visualizes the trend of convex hull areas, i.e., uncertainty, for the PCA technique. As anticipated, the convex hull area values tend to increase with higher temperatures across all models and prompt types. This trend suggests that as the temperature increases, the LLMs’ responses become more diverse or uncertain, resulting in larger convex hull areas. Among the three models, Gemini-pro with the confusing prompt type exhibits the highest convex hull area (i.e., uncertainty) at higher temperatures, with a value of 4.0 at a temperature setting of 1.0. This is followed by GPT-3.5-turbo with the confusing prompt type, which has a convex hull area of 2.4 at the same temperature setting. All other case studies (LLMs with different prompt types) exhibit lower uncertainty, with convex hull areas of 1.0 or less. This indicates that GPT-4o can maintain low uncertainty even at higher temperature settings.
Fig. 2b and c represent the results for Isomap and MDS dimensionality reduction techniques, respectively. These figures show similar trends to those observed in the PCA results. Gemini-pro and GPT-3.5-turbo with the confusing prompt type exhibit a higher convex hull area (i.e., uncertainty) at higher temperatures than other cases, i..e, the convex hull area of 30 and 18 for Isomap, and the convex hull area of 24 and 12 for MDS. Other cases exhibit lower uncertainty, with convex hull areas of 5.0 or less. Once again, GPT-4o maintaind low uncertainty even at higher temperature settings.
Tables 2, 3, and 4 provide comprehensive results understanding how temperature settings and prompt types influence the variability and consistency of each model response by examining the median and standard deviation (std) based on the convex hull areas.
Table 2 presents the statistical metrics of the mean and standard deviation of the convex hull areas for different prompt types across various temperature settings for three distinct models, namely Gemini-Pro, GPT-3.5-turbo, and GPT-4o, using the PCA dimensionality reduction technique. According to the results for Gemini-Pro, the convex hull area for easy prompts increases from 0.392 at a temperature of 0.25 to 2.548 at a temperature setting of 1.0, and the standard deviation increases from 0.095 at a temperature of 0.25 to 0.378 at a temperature setting of 1.0. Similar trends are observed for moderate and confusing prompts, indicating how response variability changes with temperature settings for this model. For moderate prompts, the mean convex hull area increases from 0.308 at a temperature setting of 0.25 to 2.482 at a temperature setting of 1.0, with the corresponding standard deviation increasing from 0.230 to 1.050. For confusing prompts, the mean convex hull area increases from 0.417 at a temperature setting of 0.25 to 7.798 at a temperature setting of 1.0, with corresponding standard deviations. GPT-3.5-turbo shows a higher mean convex hull area for confusing prompts than for easy and moderate prompts, particularly noticeable at higher temperatures. For example, at a temperature of 1.0, the mean convex hull area for confusing prompts is 8.926 with a standard deviation of 3.809, indicating significant response variability. GPT-4o offers better performance compared to the other models. For example, the mean convex hull area for easy prompts starts at 0.285 at a temperature setting of 0.25 and increases to 2.221 at a temperature setting of 1.0. As expected, the mean convex hull area is higher for moderate and confusing prompts. The standard deviation results increase similarly, indicating more dispersed responses at higher temperatures-for instance, from 0.085 at a temperature setting of 0.25 to 1.307 at a temperature setting of 1.0 for confusing prompts.
Table 3 provides a comprehensive analysis of the mean and standard deviation (std) of convex hull areas for selected language models and different prompt types at various temperature settings, similar to the previous case, but using the Isomap dimensionality reduction technique. The statistical metrics in this table indicate the typical response variability for each model, prompt type, and temperature. For Gemini-Pro, the mean convex hull areas increase with higher temperatures across all prompt types, except for moderate prompts at a temperature setting of 0.5. For example, the mean value for confusing prompts increases from 7.914 at a temperature setting of 0.25 to 29.966 at a temperature setting of 1.0, and the std value increases from 12.499 at 0.25 to 24.085 at 1.0. The results indicate significant variability, particularly at higher temperatures, suggesting that the model’s responses become more diverse as temperature increases. Similarly, the GPT-3.5-turbo model also shows high mean convex hull areas for confusing prompts, especially at higher temperatures. At a temperature setting of 1.0, the mean convex hull area for confusing prompts reaches 18.703, with a standard deviation of 13.801, highlighting significant variability in the model’s responses. For GPT-4o, the mean convex hull areas show a reasonable increase with temperature for confusing prompts, where the mean value increases from 0.956 at a temperature setting of 0.25 to 3.728 at a temperature setting of 1.0, and the std value increases from 0.697 at 0.25 to 2.432 at 1.0. However, for easy and moderate prompts, the statistical results appear unusual. For example, it is expected that the mean values would increase at higher temperature settings, but the mean values decrease from 4.064 at a temperature setting of 0.25 to 3.181 at a temperature setting of 1.0.
Table 4 presents a comprehensive analysis of uncertainty for the responses generated by the selected LLMs and prompt types in terms of mean and standard deviation statistical metrics using the MDS dimensionality reduction technique. For the Gemini-Pro model, the convex hull area varies with the prompt type. Easy prompts result in a mean of 2.437 with a standard deviation of 0.569 at a temperature setting of 0.25, and a mean of 4.957 with a standard deviation of 1.259 at a temperature setting of 1.0, indicating high variability in clustering. The statistical values are higher for confusing prompts, with a mean reaching 24.050 and a standard deviation of 10.298 at a temperature setting of 1.0. As expected, easy prompts lead to smaller convex hull areas with more consistent cluster areas. The GPT-3.5-turbo model also demonstrates high statistical results (i.e., mean and std) for convex hull areas, particularly for confusing prompts, such as a mean of 12.169 with a standard deviation of 4.651 at a temperature setting of 1.0. For the GPT-4o model, the clustering results are relatively stable, with a mean of 1.378 for easy prompts and 2.292 for moderate prompts at a temperature setting of 1.0. The mean convex hull area for confusing prompts is 2.047 with a standard deviation of 1.569 at a temperature setting of 0.25, and increases to 3.744 with a standard deviation of 1.881 at a temperature setting of 1.0, indicating moderate variability in response clustering compared to the results from easy and moderate prompts.
The visualizations of the convex hull areas for the selected prompt at a temperature setting of 1.0 (i.e., confusing) are shown in Fig. 3a–c. Each subfigure illustrates the two-dimensional projection of the response embeddings after applying the PCA, Isomap, and MDS dimensionality reduction techniques. According to the figure, for the selected example case, the model responses were more homogeneous, converging into one main group with low variability. A single cluster with a smaller convex hull area suggests that the prompt was straightforward or unambiguous, leading the model to generate more consistent responses. Conversely, the presence of multiple clusters indicates significant variability in the model’s responses, suggesting that the prompt was either ambiguous or complex, allowing for multiple valid interpretations. The larger overall convex hull area reflects higher uncertainty in the model’s outputs, as the responses are spread across a wider semantic space.
4.2 Discussion
This subsection investigates the results obtained through the PCA technique for each prompt type, i.e., ’Easy’, ’Moderate’, and ’Confusing’, in Fig. 4a–c, in terms of uncertainty value and temperature setting, to provide a detailed analysis.
4.2.1 Easy prompts
Figure 4a illustrates the relationship between the uncertainty values and the temperature settings for easy prompts. As observed, the uncertainty values tend to increase with higher temperatures in all models. It is an expected pattern for LLMs and their responses. This trend indicates that as the temperature increases, the responses become more diverse, leading to larger convex hull areas. Among the three models, GPT-3.5-turbo exhibits the highest uncertainty at higher temperatures, i.e., a mean of 2.269 at 1.0, followed by Gemini-pro (a mean of 2.548 at 1.0) and GPT-4o (mean of 2.221 at 1.0). This indicates that GPT-4o shows the lowest uncertainty at lower temperature settings, i.e., 0.285 at 0.25, and generates a broader range of responses under varying temperature settings, reflecting its ability to capture a wider spectrum of possible outputs.
4.2.2 Moderate prompts
The results for moderate prompts, shown in Fig. 4b, follow a pattern similar to those of easy prompts, with increasing uncertainty values as the temperature increases. However, the overall uncertainty values are higher for moderate prompts compared to easy prompts, reflecting the increased complexity and variability of LLMs’ responses. Gemini-Pro shows the highest uncertainty, particularly at high temperatures, i.e., a mean of 2.4819 at 1.0. GPT-3.5-turbo and GPT-4o also demonstrate increasing uncertainty with higher temperatures, a mean of 2.2830 at 0.25 and a mean of 2.0614 at 1.0. All models tend to low uncertainty values at low-temperature settings as expected, i.e., a mean of 0.3084, 1.1996, and 0.3003 at 0.25 for Gemini-Pro, GPT-3.5-turbo, and GPT-4o, respectively. However, GPT-3.5-turbo shows a higher uncertainty value (a mean of 1.1996) at a temperature setting of 0.25 compared to the other LLMs.
4.2.3 Confusing prompts
Figure 4c presents the uncertainty values for confusing prompts. As expected, the uncertainty is significantly higher for confusing prompts compared to easy and moderate prompts, highlighting the challenge of generating consistent responses to complex and ambiguous queries. The increase in uncertainty with temperature is expected for confusing prompts, particularly for GPT-3.5-turbo and Gemini-pro at higher temperatures, i.e., a mean of 8.9260, and 7.7980 at 1.0. GPT-4o offers better performance, i.e, a mean of 3.1909 at 1.0, and tends to a more gradual increase (between 0.2015 and 3.1909 at 0.5 and 1.0 temperature settings, respectively) compared to the other two models. This indicates that GPT-3.5-turbo and Gemini-pro are more sensitive to high-temperature settings, generating a wider range of responses when faced with confusing prompts.
5 Observations
The results from these experiments demonstrate the importance of the proposed convex hull-based approach in UQ of LLMs’ responses. The convex hull areas provide a robust measure of the diversity and variability of LLMs’ responses, with a higher value indicating greater uncertainty. The analysis for the selected cases can be extended through the detailed results. However, the following observations can be highlighted and categorized into temperature setting, LLM comparison, and prompt complexity:
-
1.
Temperature Setting: All models show increasing uncertainty with higher temperature settings, i.e., over 0.75. It is a highly expected observation. It highlights the importance of temperature settings in controlling the randomness and diversity of generated LLMs’ responses.
-
2.
LLM Comparison: GPT-4o consistently offers a better performance in terms of uncertainty compared to GPT-3.5-turbo and Gemini-pro. On the other hand, GPT-3.5-turbo shows worse performance, particularly at higher temperature settings, i.e., generating more diverse responses.
-
3.
Prompt Complexity: The uncertainty values are higher for more complex prompts, i.e., confusing prompts, across all selected models, indicating the inherent difficulty in generating consistent responses to such queries.
-
4.
Dimensionality Reduction Techniques: Dimensionality reduction techniques are used to project the original high-dimensional embeddings onto a two-dimensional subspace that maximizes variance. For the selected techniques, i.e., PCA, Isomap, and MDS, the convex hull area results demonstrate similar trends, particularly for confusing prompts at higher temperature settings. These methods have no direct effect on uncertainty; however, they are valuable tools for computing convex hull areas, helping to understand and visualize response variability.
These observations emphasize the importance of temperature settings and prompt complexity in the uncertainty of LLMs’ responses. The proposed convex-hull-based geometric approach provides a novel and effective metric to capture the uncertainty of LLMs’ responses, as well as valuable perception for the development and evaluation of current and future LLMs.
6 Conclusion
This study proposes a novel geometric approach to quantifying uncertainty for LLMs. The proposed approach utilizes the spatial properties of convex hulls formed by embeddings of model responses to measure dispersion, particularly for complex and high-dimensional text generated by LLMs. In this study, three types of prompts are used, i.e., ’easy’, ’moderate’, and ’confusing’, for three different LLMs, namely GPT-3.5-turbo, GPT-4, and Gemini-pro at various temperature settings, i.e., 0.25, 0.5, 0.75, and 1.0. The responses are transformed into high-dimensional embeddings using a BERT model and then projected into a 2D space using PCA, Isomap, and MDS. The DBSCAN algorithm is utilized to identify clusters. The convex hull areas that surround these clusters are used as the metric of uncertainty. The proposed approach provides a clear and interpretable metric to develop more reliable LLMs. By understanding the factors that influence uncertainty, such as temperature settings and prompt complexity, researchers and engineers can develop and evaluate LLMs more effectively.
The experimental results indicated several key findings: (1) the uncertainty increased with higher temperature settings across all models, highlighting the role of temperature in controlling the diversity of generated responses, (2) GPT-4 consistently exhibited higher uncertainty values compared to GPT-3.5-turbo and Gemini-pro, indicating its enhanced ability to produce a wider range of responses, and (3) the uncertainty values were significantly higher for more complex prompts, particularly confusing prompts, across all models, reflecting the inherent challenge in generating consistent responses to such queries.
For future work, the proposed geometric approach will be extended to other types of embeddings and clustering algorithms for more robust and reliable LLMs in critical domains, such as healthcare, finance, and law. In addition, integrating this uncertainty metric with other evaluation criteria, such as accuracy and coherence, could provide a more comprehensive assessment of LLMs’ performance.
Data availability
Data cannot be shared openly but are available on request from authors
References
Bharathi Mohan G, Prasanna Kumar R, Vishal Krishh P, Keerthinathan A, Lavanya G, Meghana MKU, Sulthana S, Doss S. An analysis of large language models: their impact and potential applications. Knowl Inform Syst. 2024. https://doi.org/10.1007/s10115-024-02120-8.
Bsharat SM, Myrzakhan A, Shen Z. Principled instructions are all you need for questioning LLama-1/2, GPT-3.5/4. arXiv preprint arXiv:2312.16171. 2023.
Suta P, Lan X, Wu B, Mongkolnam P, Chan JH. An overview of machine learning in chatbots. Int J Mech Eng Robot Res. 2020;9(4):502–10.
Kandpal P, Jasnani K, Raut R, Bhorge S. Contextual chatbot for healthcare purposes (using deep learning). In: 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4). IEEE; 2020. pp. 625–34.
Javaid M, Haleem A, Singh RP. Chatgpt for healthcare services: an emerging stage for an innovative perspective. BenchCounc Trans Benchmark Stand Eval. 2023;3(1): 100105.
Aydın N, Erdem OA. A research on the new generation artificial intelligence technology generative pretraining transformer 3. In: 2022 3rd International Informatics and Software Engineering Conference (IISEC). IEEE; 2022. pp. 1–6.
Kuzlu M, Xiao Z, Sarp S, Catak FO, Gurler N, Guler O. The rise of generative artificial intelligence in healthcare. In: 2023 12th Mediterranean Conference on Embedded Computing (MECO). 2023. pp. 1–4.
Abdar M, Pourpanah F, Hussain S, Rezazadegan D, Liu L, Ghavamzadeh M, Fieguth P, Cao X, Khosravi A, Acharya UR, et al. A review of uncertainty quantification in deep learning: techniques, applications and challenges. Inform Fus. 2021;76:243–97.
Felicioni N, Maystre L, Ghiassian S, Ciosek K. On the importance of uncertainty in decision-making with large language models. arXiv preprint arXiv:2404.02649. 2024.
Zhang C, Liu F, Basaldella M, Collier N. Luq: long-text uncertainty quantification for llms. arXiv preprint arXiv:2403.20279. 2024.
Huang Y, Song J, Wang Z, Chen H, Ma L. Look before you leap: an exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236. 2023.
Tanneru SH, Agarwal C, Lakkaraju H. Quantifying uncertainty in natural language explanations of large language models. In: International Conference on Artificial Intelligence and Statistics. PMLR; 2024. pp. 1072–80.
Huang X, Li S, Yu M, Sesia M, Hassani H, Lee I, Bastani O, Dobriban E. Uncertainty in language models: assessment through rank-calibration. arXiv preprint arXiv:2404.03163. 2024.
Yang Y, Li H, Wang Y, Wang Y. Improving the reliability of large language models by leveraging uncertainty-aware in-context learning. arXiv preprint arXiv:2310.04782. 2023.
Chen ZZ, Ma J, Zhang X, Hao N, Yan A, Nourbakhsh A, Yang X, McAuley J, Petzold L, Wang WY. A survey on large language models for critical societal domains: finance, healthcare, and law. arXiv preprint arXiv:2405.01769. 2024.
Ouyang S, Yun H, Zheng X. How ethical should AI be? how AI alignment shapes the risk preferences of llms. arXiv preprint arXiv:2406.01168. 2024.
Savage T, Wang J, Gallo R, Boukil A, Patel V, Ahmad Safavi-Naini SA, Soroush A, Chen JH. Large language model uncertainty measurement and calibration for medical diagnosis and treatment. medRxiv. 2024. pp. 2024–06.
Catak FO, Kuzlu M. Trustworthy ai: from theory to practice. 2024. https://digitalcommons.odu.edu/engtech_books/5. Accessed 23 Nov 2024.
Cheong I, Xia K, Feng K, Chen QZ, Zhang AX. (a) i am not a lawyer, but...: engaging legal experts towards responsible llm policies for legal advice. arXiv preprint arXiv:2402.01864. 2024.
Nemani V, Biggio L, Huan X, Hu Z, Fink O, Tran A, Wang Y, Zhang X, Hu C. Uncertainty quantification in machine learning for engineering design and health prognostics: a tutorial. Mech Syst Signal Process. 2023;205: 110796.
Stracuzzi DJ, Darling MC, Peterson MG, Chen MG. Quantifying uncertainty to improve decision making in machine learning. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States), Tech Rep. 2018.
Jalaian B, Lee M, Russell S. Uncertain context: uncertainty quantification in machine learning. AI Mag. 2019;40(4):40–9.
Pernot P. Calibration in machine learning uncertainty quantification: beyond consistency to target adaptivity. APL Mach Learn. 2023;1(4): 046121.
Hu M, Zhang Z, Zhao S, Huang M, Wu B. Uncertainty in natural language processing: sources, quantification, and applications. arXiv preprint arXiv:2306.04459. 2023.
Tavazza F, Choudhary K, DeCost B. Approaches for uncertainty quantification of ai-predicted material properties: a comparison. arXiv preprint arXiv:2310.13136. 2023.
Lan S, Li S, Shahbaba B. Scaling up Bayesian uncertainty quantification for inverse problems using deep neural networks. SIAM/ASA J Uncertain Quantif. 2022;10(4):1684–713.
Catak FO, Yue T, Ali S. Uncertainty-aware prediction validator in deep learning models for cyber-physical system data. ACM Trans Softw Eng Methodol. 2022;31(4):1–31.
Chen P, Schwab C. Model order reduction methods in computational uncertainty quantification. In: Handbook of uncertainty quantification. 2016. pp. 1–53.
Cheng S, Quilodrán-Casas C, Ouala S, Farchi A, Liu C, Tandeo P, Fablet R, Lucor D, Iooss B, Brajard J, Xiao D, Janjic T, Ding W, Guo Y, Carrassi A, Bocquet M, Arcucci R. Machine learning with data assimilation and uncertainty quantification for dynamical systems: a review. IEEE/CAA J Autom Sin. 2023;10(6):1361–87.
Gal Y, Ghahramani Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learning. PMLR; 2016. pp. 1050–59.
Devlin J, Chang M-W, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. 2018.
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165. 2020.
Ye F, Ming Y, Pang J, Wang L, Wong DF, Emine Y, Shi S, Tu Z. Benchmarking llms via uncertainty quantification. arXiv preprint arXiv:2401.12794. 2024.
Tang M, Zhao J-Y, Tong R-F, Manocha D. GPU accelerated convex hull computation. Computers Gr. 2012;36(5):498–506.
Author information
Authors and Affiliations
Contributions
F.O.C and M.K. wrote the main manuscript text. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Catak, F.O., Kuzlu, M. Uncertainty quantification in large language models through convex hull analysis. Discov Artif Intell 4, 90 (2024). https://doi.org/10.1007/s44163-024-00200-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s44163-024-00200-w