1 Introduction

In recent years, with advanced computing technologies, the field of Natural Language Processing (NLP) has improved significantly, leading to the development of sophisticated large language models (LLM), such as GPT-3.5/4, Gemini, LLama, and others [1, 2]. These LLMs have increasingly been used as powerful tools for NLP applications, and demonstrated significant performance in generating contextually relevant text and supporting a variety of applications, such as chatbots, automated content generation, virtual agents, text categorization, language translation, and many more [3,4,5].

In the literature, the number of studies in NLP and LLMs has dramatically increased after introducing the Generative Pre-trained Transformer 3 (GPT-3) by OpenAI in 2020, which was the most sophisticated neural network by that time [6]. It could create enhanced textual, visual, and auditory content without human intervention [7]. However, there are serious concerns regarding LLMs’ responses in terms of reliability and uncertainty. The authors in ref. [8] highlight the importance of Uncertainty Quantification (UQ) methods in reducing uncertainties in optimization and decision-making processes with traditional methods, such as Bayesian approximation and ensemble learning. The authors in ref. [9] emphasize the role of uncertainty estimation, particularly epistemic uncertainty, for LLMs, and conduct different techniques, i.e., Laplace Approximation, Dropout, and Epinets. They indicate that uncertainty plays a crucial role in developing LLM-based agents for decision making, and its information can significantly contribute to the overall performance of LLMs. However, these traditional approaches have difficulties dealing with the complex and high-dimensional characteristics of texts generated by LLMs. Therefore, novel approaches are needed to effectively capture the uncertainty in LLMs’ responses and enhance their trustworthiness [10]. The authors in ref. [11] conduct an early exploratory study of uncertainty measurement with twelve uncertainty estimation methods to help characterize the prediction risks of LLMs, as well as to investigate the correlations between LLMs’ prediction uncertainty and their performance. Their results indicate that there is a high potential for uncertainty estimation to identify the inaccurate or unreliable predictions generated by LLMs. To quantify the uncertainty of LLM, the authors propose two metrics, i.e., Verbalized Uncertainty and Probing Uncertainty in ref. [12]. Verbalized uncertainty employs prompting the LLM to express its confidence in its explanations, whereas probing uncertainty employs sample and model perturbations to quantify the uncertainty. According to the results, the probing uncertainty offers a better performance than verbalized uncertainty with lower uncertainty corresponding to explanations with higher faithfulness. The study [13] presents a framework, called Rank-Calibration, to assess uncertainty and confidence measures for LLMs. The framework also provides a comprehensive robustness analysis and interpretability for LLMs. The other study [14] also presents a framework to improve the reliability of LLMs, called Uncertainty-Aware In-context Learning. It involves fine-tuning the LLM using a calibration dataset, and evaluates the model’s knowledge by analyzing multiple responses to the same query to determine if a correct answer is present.

To address the gap, in this paper, a convex hull-based geometric approach is proposed for UQ of LLMs. The main contributions of this study are (1) Introduce a geometric approach to UQ in LLMs’ responses, (2) Implement a comprehensive system processing a wide range of prompts, and generating responses using different models and temperature settings, and (3) Highlight the relationship between the prompt complexity, model settings, and uncertainty.

2 Related work

In recent years, uncertainty quantification (UQ) has been one of the challenging topics in the context of AI/ML concept, particularly critical applications using LLMs in healthcare, finance and law [15,16,17,18,19]. Uncertainty, which refers to the confidence of the model in its predictions, can originate from different sources, such as model architecture, model parameters, noise, and insufficient information in the dataset(s) [20,21,22,23] in addition to the nature of LLMs. Training data can also be an influential source of uncertainty due to the complexity and diversity of the selected dataset(s). Furthermore, uncertainty should be addressed individually for each domain based on its specific requirements and constraints. For example, data sensitivity, privacy, structure, and ethical considerations differ between healthcare and finance applications. Therefore, it is essential to adapt UQ methods to the decision-making processes of each critical application based on domain-specific factors to mitigate the uncertainty of LLMs. The study [8] provides a comprehensive overview of the UQ methods. Regarding UQ, there are three widely used methods, i.e., confidence-based methods [24], ensemble methods, and Bayesian methods. Confidence-based methods evaluate the reliability of model outputs using entropy, probability, calibration, and ensemble approaches. Ensemble methods employ multiple models to calculate the uncertainty estimation [25], while Bayesian approximation methods use computational methods [26], such as Monte Carlo. The authors in ref. [27] propose a method called NIRVANA (uNcertaInty pRediction ValidAtor iN Ai) to investigate the correlation between uncertainty and the performance of a deep learning model. Experimental results show that uncertainty quantification is negatively correlated with the model’s prediction performance. Additionally, adjusting dropout ratios effectively reduces the uncertainty of correct predictions while increasing the uncertainty of incorrect ones. The study [28] provides a comprehensive survey of the mathematical and computational foundations of model order reduction (MOR) techniques for UQ problems with distributed uncertainties. These MOR techniques are applied to both forward and inverse UQ problems, resulting in significant computational reductions in the statistical evaluation of many-query scenarios, as well as in real-time applications for rapid Bayesian inversion. Another study [29] provides an overview of recent advances in the interdisciplinary field that combines data assimilation (DA), uncertainty quantification (UQ) and machine learning (ML) to address critical challenges in high-dimensional dynamical systems, such as system identification, reduced-order surrogate modeling, error covariance specification, and model error correction. It focuses on how existing limitations of DA and UQ can be mitigated by ML methods, and vice versa, and aims to assist ML scientists in improving model accuracy and interoperability and DA/UQ experts in integrating ML techniques into DA and UQ processes.

This section provides a brief overview of the existing literature on UQ methods applied to NLP and LLMs and provides an analysis of responses generated for a selected confusing prompt.

In the literature, there are many studies focusing on developing uncertainty quantification techniques in the context of NLP and LLMs. The study [30] offers the use of dropout as a Bayesian approximation to estimate uncertainty in deep learning-based models. This method, known as Monte Carlo dropout, involves applying dropout during both training and inference to generate multiple stochastic forward passes and approximate the posterior distribution of the model’s predictions. The research explores the use of pre-trained language models, such as BERT [31] and GPT-3 [32], for UQ. Furthermore, most UQ studies have limited capabilities and applications, i.e., short text generation. To bridge this gap, the authors in ref. [10] indicate the limitations of existing methods for long text generation, and propose a novel UQ approach, called Luq-Ensemble (improved version of LUQ), which essembles responses from multiple models and selects the response with the least uncertainty. Based on the results, the proposed approach demonstrates better performance compared to traditional methods and improves the factual accuracy of the responses compared to the best individual LLM. In addition to increasing the number of applications using LLMs and high-reliability expectations, novel UQ methods and benchmarks for comprehensive evaluation strategies must be developed to understand the uncertainty of LLMs to be able to help researchers and engineers towards more robust and reliable LLMs. Fortunately, several studies propose benchmarks to address this. The authors in ref. [33] develop a benchmark for LLMs involving UQ with an uncertainty-aware evaluation metric, called UAcc for both prediction accuracy and uncertainty. The developed benchmark consists of eight LLMs (LLM series) spanning five representative natural language processing tasks. The results indicate several findings related to LLM uncertainty, i.e., higher accuracy may tend to lower certainty, and instruction fine-tuning may increase the uncertainty of LLMs.

Despite these advancements, the proposed methods may not fully capture models’ underlying uncertainties for LLMs due to several reasons mentioned earlier. Evaluating the performance of LLMs has been essential for the development and deployment of robust LLMs. It makes the need for novel UQ approaches more critical. One of the possible approaches is the geometric approach to UQ, particularly using convex hull analysis. This method has received limited attention in the literature. This study builds on the foundation of these existing methods and introduces a novel approach that leverages the geometric properties of response embeddings to measure uncertainty in LLM outputs.

Table 1 presents an analysis of the responses generated for a strategically chosen confusing prompt: “Is a broken clock right if it tells the right time twice a day?”. This prompt, carefully selected for its potential to produce a wide range of responses, is used to evaluate the variability and spread of responses at different temperature settings, i.e., 0.25, 0.5, 0.75, and 1.0, using the convex hull area as a metric.

Table 1 Convex hull area metrics for responses to a confusing prompt

The table includes three responses, Response 1, Response 2, and Response 3, respectively. Response 1 discusses the paradox of a rule that states all rules have exceptions, arguing that such a rule cannot be valid since it would mean that at least one rule does not have an exception. Response 2 expands on the logical inconsistency if the rule itself is an exception, suggesting that some rules must not have exceptions, which contradicts the original assertion. Response 3 is similar to Response 1, addressing the self-contradiction inherent in the statement.

In this case, the convex hull area is measured for each response (Response 1, 2, and 3) at four different temperature settings (0.25, 0.5, 0.75, and 1.0). These temperatures control the randomness in the response generation process. The convex hull area values are 1.3122 at 0.25, 5.0236 at 0.5, 6.5476 at 0.75, and 5.7562 at 1.0. These values indicate the spread and variability of the responses, which can be interpreted as follows:

  • A larger convex hull area suggests greater variability, which, in the context of this study, can be indicative of increased uncertainty or diversity in the generated responses.

On the other hand, it can be considered a threshold to limit or maximize the convex hull area. As seen in the table, the areas are close to each other for high-temperature settings, i.e., 0.5, 0.75, and 1.0. It is between 5.0236 and 6.5476.

Table 1 also provides us with a brief analysis of how the model’s responses change at different temperature settings for confusing prompts. Traditional methods can still evaluate the uncertainty of model outputs. However, they have difficulties when applied to LLMs, and novel proposed approaches can be adopted into these traditional techniques. To address this requirement, in this study, a novel geometric approach to UQ using convex hull analysis is proposed for the uncertainty of LLMs responses along with several use cases.

3 System model

This section presents the proposed approach to calculate the uncertainty of the generated responses based on the geometric properties of convex hull areas for each prompt. The system processes a comprehensive set of categorized prompts, meticulously focusing on three distinct types: ’easy’, ’moderate’, and ’confusing’. The responses are transformed into high- dimensional embeddings via a BERT model and subsequently projected into a two-dimensional space using dimensionality reduction techniques such as PCA, Isomap, MDS, and then clustered by the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm.

The overview of the system is shown in Fig. 1. The figure illustrates the workflow of the model, from the input prompt to the calculation of the convex hull area, providing a measure of uncertainty. The process begins with a given prompt (e.g., “Explain the process of photosynthesis in detail.”) and a specified temperature setting, which is fed into a LLM. The model generates multiple responses to the prompt, each of them provides a potentially different elaboration on the topic. These responses are then encoded into high-dimensional embeddings using a BERT model. The embeddings, initially in a high-dimensional space, are projected onto a two-dimensional space using Principal Component Analysis (PCA) for easier visualization and clustering. The 2D embeddings are clustered using the DBSCAN algorithm, which identifies groups of similar responses. For each cluster, a convex hull is computed, representing the smallest convex boundary that encloses all points in the cluster. The area of each convex hull is calculated, with the total area serving as a measure of the uncertainty of the model’s responses to the given prompt. The example shown includes a plot of the 2D embeddings with the convex hull of the densest cluster highlighted, illustrating the spatial distribution and clustering of the responses.

Fig. 1
figure 1

The system overview for calculating uncertainty in LLM responses

Let \({\mathcal {P}} = \{p_1, p_2, \ldots , p_n\}\) denote the set of prompts, which are rigorously categorized into three types: ’easy’, ’moderate’, and ’confusing’. For each prompt \(p \in {\mathcal {P}}\), it is generated a diverse set of responses \({\mathcal {R}}(p) = \{r_1, r_2, \ldots , r_m\}\) using a suite of different language models \({\mathcal {M}} = \{m_1, m_2, \ldots , m_k\}\) at varying temperature settings \({\mathcal {T}} = \{t_1, t_2, \ldots , t_l\}\).

The embeddings for each response \(r \in {\mathcal {R}}(p)\) are computed utilizing a pre-trained BERT model, represented mathematically as:

$${\mathbf{E}}(r) = {\text{BERT}}(r)$$

where \({\textbf{E}}(r) \in {\mathbb {R}}^d\) denotes the embedding vector in a \(d\)-dimensional space, encapsulating the semantic content of the response in a high-dimensional feature space.

Given the set of embeddings \({\textbf{E}}({\mathcal {R}}(p)) = \{{\textbf{E}}(r_1), {\textbf{E}}(r_2), \ldots , {\textbf{E}}(r_m)\}\), it is applied into one of the selected dimensionality reduction techniques (RD), i.e., PCA, Isomap, and MDS, to reduce the dimensionality of the embedding vectors to 2, enabling effective visualization and clustering:

$$\begin{aligned} {\textbf{E}}_\text {RD}({\mathcal {R}}(p)) = \text {RD}({\textbf{E}}({\mathcal {R}}(p)), 2) \end{aligned}$$

where \({\textbf{E}}_\text {RD}(r) \in {\mathbb {R}}^2\), the transformation is achieved by projecting the original high-dimensional embeddings onto a two-dimensional subspace that maximizes the variance, thereby preserving the most significant features of the data.

Subsequently, the DBSCAN algorithm is utilized to cluster the reduced embeddings to identify distinct groupings within the response space:

$$\begin{aligned} {\textbf{L}} = \text {DBSCAN}({\textbf{E}}_\text {RD}({\mathcal {R}}(p)), \epsilon = 0.25 \times t \times 4.0, \text {min\_samples}=3) \end{aligned}$$

where \({\textbf{L}}\) represents the set of cluster labels assigned to each embedding point, with \(\epsilon\) controlling the maximum distance between points in the same cluster and \(\text {min\_samples}\) specifying the minimum number of points required to form a cluster.

For each identified cluster \(c \in {\textbf{L}}\), excluding noise points (i.e., \(c = -1\)), we compute the convex hull and its corresponding area, which encapsulates the geometric boundary of the cluster:

$$\begin{aligned} \text {ConvexHull}({\textbf{E}}_\text {RD}(c)), \quad \text {Area}(\text {ConvexHull}(c)) \end{aligned}$$

The convex hull is the smallest convex polygon that can enclose all the points in the cluster, and the area of the convex hull is a measure of the spatial extent of the cluster.

The total convex hull area for a given prompt \(p\) at temperature \(t\) is then defined as the summation of the areas of all clusters:

$$\begin{aligned} A(p, t) = \sum _{c \in {\textbf{L}}, c \ne -1} \text {Area}(\text {ConvexHull}(c)) \end{aligned}$$

This metric provides an aggregate measure of the uncertainty and dispersion of the model’s responses to the prompt, with larger areas indicating higher uncertainty.

The algorithm for calculating the uncertainty using convex hull areas is outlined in Algorithm 1. This algorithm takes as input a prompt, a temperature setting, and a model, computes embeddings, reduces their dimensionality, performs clustering, and finally computes and returns the convex hull areas.

Algorithm 1
figure a

Uncertainty Calculation Using Convex Hull Area

In addition, there are still concerns regarding the computational costs of calculating convex hulls. These costs are significantly dependent on both the number of input points and the dimensionality of the space. Fortunately, with advanced computing technologies like GPUs and improved algorithms, the computational performance of convex hull calculations has been enhanced. The authors in ref. [34] proposed a hybrid algorithm to address this issue by using a GPU-based filter to reduce the number of points and then utilizing a CPU to compute the final convex hull from the remaining points. The results indicate that the proposed hybrid algorithm significantly improves performance, i.e., a 10–27 times speedup for static point sets and a 22–46 times speedup for deforming point sets. In this study, the convex hull area is calculated in Python based on the 2D array (a fixed 768 items in an array projected from BERT embeddings with 30 responses for each prompt), which are clustered using the DBSCAN algorithm. Therefore, in this study, the computational cost associated with calculating convex hulls is not a concern in terms of computational expense and is not dependent on the complexity or length of the given prompt.

4 Experimental results and discussion

This section provides experimental results obtained by the convex hull-based UQ method using different dimensionality reduction techniques, i.e., PCA (Principal Component Analysis), MDS (Multidimensional Scaling), and Isomap for the selected LLMs, and a discussion on these results.

4.1 Experimental results

The experiments were conducted using three types of prompts: ’easy’, ’moderate’, and ’confusing’, with responses generated by three different models (Gemini-Pro, GPT-3.5-turbo, and GPT-4o) at various temperature settings (0.25, 0.5, 0.75, and 1.0). The results are presented in Fig. 2a–c, which show the uncertainty values (convex hull areas) plotted against the temperature settings for each model and prompt type across different dimensionality reduction techniques. Table 23, and 4 provide the statistical details, including the mean and standard deviation (std), for each model and prompt type at different temperature settings.

Fig. 2
figure 2

Convex hull areas and trends for different dimensionality reduction techniques a PCA, b Isomap, and c MDS

Fig. 2a–c shows convex hull areas and trends for each dimensionality reduction technique, i.e., PCA, Isomap, and MDS, based on the temperature setting, model and prompt type. In this figure, the x-axis represents the temperature setting, while the y-axis shows the Convex Hull Area. Figure 2a visualizes the trend of convex hull areas, i.e., uncertainty, for the PCA technique. As anticipated, the convex hull area values tend to increase with higher temperatures across all models and prompt types. This trend suggests that as the temperature increases, the LLMs’ responses become more diverse or uncertain, resulting in larger convex hull areas. Among the three models, Gemini-pro with the confusing prompt type exhibits the highest convex hull area (i.e., uncertainty) at higher temperatures, with a value of 4.0 at a temperature setting of 1.0. This is followed by GPT-3.5-turbo with the confusing prompt type, which has a convex hull area of 2.4 at the same temperature setting. All other case studies (LLMs with different prompt types) exhibit lower uncertainty, with convex hull areas of 1.0 or less. This indicates that GPT-4o can maintain low uncertainty even at higher temperature settings.

Fig. 2b and c represent the results for Isomap and MDS dimensionality reduction techniques, respectively. These figures show similar trends to those observed in the PCA results. Gemini-pro and GPT-3.5-turbo with the confusing prompt type exhibit a higher convex hull area (i.e., uncertainty) at higher temperatures than other cases, i..e, the convex hull area of 30 and 18 for Isomap, and the convex hull area of 24 and 12 for MDS. Other cases exhibit lower uncertainty, with convex hull areas of 5.0 or less. Once again, GPT-4o maintaind low uncertainty even at higher temperature settings.

Tables 23, and 4 provide comprehensive results understanding how temperature settings and prompt types influence the variability and consistency of each model response by examining the median and standard deviation (std) based on the convex hull areas.

Table 2 presents the statistical metrics of the mean and standard deviation of the convex hull areas for different prompt types across various temperature settings for three distinct models, namely Gemini-Pro, GPT-3.5-turbo, and GPT-4o, using the PCA dimensionality reduction technique. According to the results for Gemini-Pro, the convex hull area for easy prompts increases from 0.392 at a temperature of 0.25 to 2.548 at a temperature setting of 1.0, and the standard deviation increases from 0.095 at a temperature of 0.25 to 0.378 at a temperature setting of 1.0. Similar trends are observed for moderate and confusing prompts, indicating how response variability changes with temperature settings for this model. For moderate prompts, the mean convex hull area increases from 0.308 at a temperature setting of 0.25 to 2.482 at a temperature setting of 1.0, with the corresponding standard deviation increasing from 0.230 to 1.050. For confusing prompts, the mean convex hull area increases from 0.417 at a temperature setting of 0.25 to 7.798 at a temperature setting of 1.0, with corresponding standard deviations. GPT-3.5-turbo shows a higher mean convex hull area for confusing prompts than for easy and moderate prompts, particularly noticeable at higher temperatures. For example, at a temperature of 1.0, the mean convex hull area for confusing prompts is 8.926 with a standard deviation of 3.809, indicating significant response variability. GPT-4o offers better performance compared to the other models. For example, the mean convex hull area for easy prompts starts at 0.285 at a temperature setting of 0.25 and increases to 2.221 at a temperature setting of 1.0. As expected, the mean convex hull area is higher for moderate and confusing prompts. The standard deviation results increase similarly, indicating more dispersed responses at higher temperatures-for instance, from 0.085 at a temperature setting of 0.25 to 1.307 at a temperature setting of 1.0 for confusing prompts.

Table 2 Mean and standard deviation of convex hull areas for different prompt types and temperatures for PCA dimensionality reduction technique

Table 3 provides a comprehensive analysis of the mean and standard deviation (std) of convex hull areas for selected language models and different prompt types at various temperature settings, similar to the previous case, but using the Isomap dimensionality reduction technique. The statistical metrics in this table indicate the typical response variability for each model, prompt type, and temperature. For Gemini-Pro, the mean convex hull areas increase with higher temperatures across all prompt types, except for moderate prompts at a temperature setting of 0.5. For example, the mean value for confusing prompts increases from 7.914 at a temperature setting of 0.25 to 29.966 at a temperature setting of 1.0, and the std value increases from 12.499 at 0.25 to 24.085 at 1.0. The results indicate significant variability, particularly at higher temperatures, suggesting that the model’s responses become more diverse as temperature increases. Similarly, the GPT-3.5-turbo model also shows high mean convex hull areas for confusing prompts, especially at higher temperatures. At a temperature setting of 1.0, the mean convex hull area for confusing prompts reaches 18.703, with a standard deviation of 13.801, highlighting significant variability in the model’s responses. For GPT-4o, the mean convex hull areas show a reasonable increase with temperature for confusing prompts, where the mean value increases from 0.956 at a temperature setting of 0.25 to 3.728 at a temperature setting of 1.0, and the std value increases from 0.697 at 0.25 to 2.432 at 1.0. However, for easy and moderate prompts, the statistical results appear unusual. For example, it is expected that the mean values would increase at higher temperature settings, but the mean values decrease from 4.064 at a temperature setting of 0.25 to 3.181 at a temperature setting of 1.0.

Table 3 Mean and standard deviation of convex hull areas for different prompt types and temperatures for Isomap dimensionality reduction technique

Table 4 presents a comprehensive analysis of uncertainty for the responses generated by the selected LLMs and prompt types in terms of mean and standard deviation statistical metrics using the MDS dimensionality reduction technique. For the Gemini-Pro model, the convex hull area varies with the prompt type. Easy prompts result in a mean of 2.437 with a standard deviation of 0.569 at a temperature setting of 0.25, and a mean of 4.957 with a standard deviation of 1.259 at a temperature setting of 1.0, indicating high variability in clustering. The statistical values are higher for confusing prompts, with a mean reaching 24.050 and a standard deviation of 10.298 at a temperature setting of 1.0. As expected, easy prompts lead to smaller convex hull areas with more consistent cluster areas. The GPT-3.5-turbo model also demonstrates high statistical results (i.e., mean and std) for convex hull areas, particularly for confusing prompts, such as a mean of 12.169 with a standard deviation of 4.651 at a temperature setting of 1.0. For the GPT-4o model, the clustering results are relatively stable, with a mean of 1.378 for easy prompts and 2.292 for moderate prompts at a temperature setting of 1.0. The mean convex hull area for confusing prompts is 2.047 with a standard deviation of 1.569 at a temperature setting of 0.25, and increases to 3.744 with a standard deviation of 1.881 at a temperature setting of 1.0, indicating moderate variability in response clustering compared to the results from easy and moderate prompts.

Table 4 Mean and standard deviation of convex hull areas for different prompt types and temperatures for MDS dimensionality reduction technique

The visualizations of the convex hull areas for the selected prompt at a temperature setting of 1.0 (i.e., confusing) are shown in Fig. 3a–c. Each subfigure illustrates the two-dimensional projection of the response embeddings after applying the PCA, Isomap, and MDS dimensionality reduction techniques. According to the figure, for the selected example case, the model responses were more homogeneous, converging into one main group with low variability. A single cluster with a smaller convex hull area suggests that the prompt was straightforward or unambiguous, leading the model to generate more consistent responses. Conversely, the presence of multiple clusters indicates significant variability in the model’s responses, suggesting that the prompt was either ambiguous or complex, allowing for multiple valid interpretations. The larger overall convex hull area reflects higher uncertainty in the model’s outputs, as the responses are spread across a wider semantic space.

Fig. 3
figure 3

Examples of the visualizations of the convex hull areas for a PCA, b Isomap, and c MDS dimensionality reduction techniques

Fig. 4
figure 4

The relationship between uncertainty and temperature settings based convex hull-based analysis for a easy, b moderate, and c confusing prompts of GPT-3.5-turbo, GPT-4o, and Gemini-pro outputs

4.2 Discussion

This subsection investigates the results obtained through the PCA technique for each prompt type, i.e., ’Easy’, ’Moderate’, and ’Confusing’, in Fig. 4a–c, in terms of uncertainty value and temperature setting, to provide a detailed analysis.

4.2.1 Easy prompts

Figure 4a illustrates the relationship between the uncertainty values and the temperature settings for easy prompts. As observed, the uncertainty values tend to increase with higher temperatures in all models. It is an expected pattern for LLMs and their responses. This trend indicates that as the temperature increases, the responses become more diverse, leading to larger convex hull areas. Among the three models, GPT-3.5-turbo exhibits the highest uncertainty at higher temperatures, i.e., a mean of 2.269 at 1.0, followed by Gemini-pro (a mean of 2.548 at 1.0) and GPT-4o (mean of 2.221 at 1.0). This indicates that GPT-4o shows the lowest uncertainty at lower temperature settings, i.e., 0.285 at 0.25, and generates a broader range of responses under varying temperature settings, reflecting its ability to capture a wider spectrum of possible outputs.

4.2.2 Moderate prompts

The results for moderate prompts, shown in Fig. 4b, follow a pattern similar to those of easy prompts, with increasing uncertainty values as the temperature increases. However, the overall uncertainty values are higher for moderate prompts compared to easy prompts, reflecting the increased complexity and variability of LLMs’ responses. Gemini-Pro shows the highest uncertainty, particularly at high temperatures, i.e., a mean of 2.4819 at 1.0. GPT-3.5-turbo and GPT-4o also demonstrate increasing uncertainty with higher temperatures, a mean of 2.2830 at 0.25 and a mean of 2.0614 at 1.0. All models tend to low uncertainty values at low-temperature settings as expected, i.e., a mean of 0.3084, 1.1996, and 0.3003 at 0.25 for Gemini-Pro, GPT-3.5-turbo, and GPT-4o, respectively. However, GPT-3.5-turbo shows a higher uncertainty value (a mean of 1.1996) at a temperature setting of 0.25 compared to the other LLMs.

4.2.3 Confusing prompts

Figure 4c presents the uncertainty values for confusing prompts. As expected, the uncertainty is significantly higher for confusing prompts compared to easy and moderate prompts, highlighting the challenge of generating consistent responses to complex and ambiguous queries. The increase in uncertainty with temperature is expected for confusing prompts, particularly for GPT-3.5-turbo and Gemini-pro at higher temperatures, i.e., a mean of 8.9260, and 7.7980 at 1.0. GPT-4o offers better performance, i.e, a mean of 3.1909 at 1.0, and tends to a more gradual increase (between 0.2015 and 3.1909 at 0.5 and 1.0 temperature settings, respectively) compared to the other two models. This indicates that GPT-3.5-turbo and Gemini-pro are more sensitive to high-temperature settings, generating a wider range of responses when faced with confusing prompts.

5 Observations

The results from these experiments demonstrate the importance of the proposed convex hull-based approach in UQ of LLMs’ responses. The convex hull areas provide a robust measure of the diversity and variability of LLMs’ responses, with a higher value indicating greater uncertainty. The analysis for the selected cases can be extended through the detailed results. However, the following observations can be highlighted and categorized into temperature setting, LLM comparison, and prompt complexity:

  1. 1.

    Temperature Setting: All models show increasing uncertainty with higher temperature settings, i.e., over 0.75. It is a highly expected observation. It highlights the importance of temperature settings in controlling the randomness and diversity of generated LLMs’ responses.

  2. 2.

    LLM Comparison: GPT-4o consistently offers a better performance in terms of uncertainty compared to GPT-3.5-turbo and Gemini-pro. On the other hand, GPT-3.5-turbo shows worse performance, particularly at higher temperature settings, i.e., generating more diverse responses.

  3. 3.

    Prompt Complexity: The uncertainty values are higher for more complex prompts, i.e., confusing prompts, across all selected models, indicating the inherent difficulty in generating consistent responses to such queries.

  4. 4.

    Dimensionality Reduction Techniques: Dimensionality reduction techniques are used to project the original high-dimensional embeddings onto a two-dimensional subspace that maximizes variance. For the selected techniques, i.e., PCA, Isomap, and MDS, the convex hull area results demonstrate similar trends, particularly for confusing prompts at higher temperature settings. These methods have no direct effect on uncertainty; however, they are valuable tools for computing convex hull areas, helping to understand and visualize response variability.

These observations emphasize the importance of temperature settings and prompt complexity in the uncertainty of LLMs’ responses. The proposed convex-hull-based geometric approach provides a novel and effective metric to capture the uncertainty of LLMs’ responses, as well as valuable perception for the development and evaluation of current and future LLMs.

6 Conclusion

This study proposes a novel geometric approach to quantifying uncertainty for LLMs. The proposed approach utilizes the spatial properties of convex hulls formed by embeddings of model responses to measure dispersion, particularly for complex and high-dimensional text generated by LLMs. In this study, three types of prompts are used, i.e., ’easy’, ’moderate’, and ’confusing’, for three different LLMs, namely GPT-3.5-turbo, GPT-4, and Gemini-pro at various temperature settings, i.e., 0.25, 0.5, 0.75, and 1.0. The responses are transformed into high-dimensional embeddings using a BERT model and then projected into a 2D space using PCA, Isomap, and MDS. The DBSCAN algorithm is utilized to identify clusters. The convex hull areas that surround these clusters are used as the metric of uncertainty. The proposed approach provides a clear and interpretable metric to develop more reliable LLMs. By understanding the factors that influence uncertainty, such as temperature settings and prompt complexity, researchers and engineers can develop and evaluate LLMs more effectively.

The experimental results indicated several key findings: (1) the uncertainty increased with higher temperature settings across all models, highlighting the role of temperature in controlling the diversity of generated responses, (2) GPT-4 consistently exhibited higher uncertainty values compared to GPT-3.5-turbo and Gemini-pro, indicating its enhanced ability to produce a wider range of responses, and (3) the uncertainty values were significantly higher for more complex prompts, particularly confusing prompts, across all models, reflecting the inherent challenge in generating consistent responses to such queries.

For future work, the proposed geometric approach will be extended to other types of embeddings and clustering algorithms for more robust and reliable LLMs in critical domains, such as healthcare, finance, and law. In addition, integrating this uncertainty metric with other evaluation criteria, such as accuracy and coherence, could provide a more comprehensive assessment of LLMs’ performance.