Abstract
The dilemma which remained unsolved using Rao-Stirling diversity, namely of how variety and balance can be combined into “dual concept diversity” (Stirling in SPRU electronic working paper series no. 28. http://www.sussex.ac.uk/Units/spru/publications/imprint/sewps/sewp28/sewp28.pdf, 1998, p. 48f.) can be clarified by using Nijssen et al.’s (Coenoses 13(1):33–38 1998) argument that the Gini coefficient is a perfect indicator of balance. However, the Gini coefficient is not an indicator of variety; this latter term can be operationalized independently as relative variety. The three components of diversity—variety, balance, and disparity—can thus be clearly distinguished and independently operationalized as measures varying between zero and one. The new diversity indicator ranges with more resolving power in the empirical case.
Avoid common mistakes on your manuscript.
Introduction
Rao-Stirling diversity is increasingly used as a measure of interdisciplinarity in bibliometrics (e.g., Rafols and Meyer 2010; Leydesdorff et al. 2017; cf. Zhou et al. 2012). In a brief communication entitled “The Repeat Rate: From Hirschman to Stirling,” Ronald Rousseau argues that this index (Rao 1982) or its monotone transformations (Zhang et al. 2016) includes the three aspects of variety, balance, and disparity as distinguished, for example, by Stirling (2007) and Rafols and Meyer (2010). Rao-Stirling diversity, however, is defined in terms of two factors, as follows:
where dij is a disparity measure between two classes i and j, and pi is the proportion of elements assigned to each class i.
I added the brackets in Eq. (1) to show that Rao-Stirling diversity is composed of two factors: The right-hand factor operationalizes disparity; the left-hand one is also known as the Hirschman–Herfindahl or Simpson index.Footnote 1 It seems to me that two factors cannot cover three concepts unless one uses two words for the same operationalization. However, one can argue that the left-hand term of Eq. (1) measures both variety and balance.
Rousseau et al. (1999) already addressed the issue when they formulated as follows (at p. 213):
It is generally agreed that diversity combines two aspects: species richness and evenness. Disagreement arises at how these two aspects should be combined, and how to measure this combination, which is then called “diversity”.
How and why are these two aspects of diversity compared and integrated in the left-hand term of Eq. (1)? Following Junge (1994), Stirling (1998, at p. 48) suggests labeling this integration as “dual concept diversity” and notes that “to many authorities in ecology, dual concept diversity is synonymous with diversity itself.”
Using Fig. 1, Stirling (1998) shows the possible dilemma when combining the two “subordinate properties” into a single “dual concept” when he formulates as follows at p. 48:
Where variety is held to be the most important property, System C might reasonably be held to be most (dual concept) diverse. Where a greater priority is attached to the evenness in the balance between options, System A might be ranked highest. In addition, there are a multitude of possible intermediate possibilities, such as System B.
Stirling (1998) then discusses at length the possibility to use the Simpson index (Simpson 1949) or Shannon-diversity (Shannon and Weaver 1949) for the measurement of “dual concept diversity” and concludes (on p. 57) that ‘there are good reasons to prefer the Shannon function as a robust general “non-parametric” measure of dual concept diversity’ (boldface and italics in the original.) Nevertheless, the Simpson index is most frequently used in the literature for this purpose (Stirling 2007).Footnote 2
An alternative operationalization of diversity
In a study of the Lorenz curve as a graphical representation of “evenness” or “balance,” Nijssen et al. (1998) proved mathematically that both the Gini index and the coefficient of variation (that is, the standard deviation divided by the mean of the distribution or, in formula format, σ/μ) are perfect indicators of balance (Rousseau, personal communication, 16 March 2018). (The coefficient of variation is not bounded between zero and one.) Additionally, the Gini index is not a measure of variety (Rousseau 2018, p. 6).
Variety is the number of categories into which system elements are apportioned (Stirling 2007, p. 709), for example, the number of species (N) in an eco-system (MacArthur 1965). The problem with integrating this measure into an index of diversity might be that N is not bound between zero and one. I suggest solving this by using n/N, that is, the relative variety: n denotes the number of categories with values larger than zero, whereas N denotes the number of available categories. In the example which I will elaborate below, for example, among the 654 classes for patents in the so-called CPC classification, Amsterdam’s portfolio at the USPTO shows a value in 131 of them: the relative variety n/N is therefore 131/654 = 0.20.
In the discussion about related and unrelated variety, Frenken et al. (2007) proposed Shannon entropy as a measure of “unrelated variety.” As a measure of “related variety” these authors use Theil’s (1972) decomposition algorithm for appreciating the grouping (cf. Leydesdorff 1991). However, this measure assumes the ex ante definition of relevant groups. The disparity matrix operates in terms of ecological distances and is not based on such a priori assumptions about structure (Izsák and Papp 1995). In other words, relatedness is already covered by the term dij in Eq. (1). Shannon entropy can be normalized relative to the maximum entropy and then varies between zero and one (or as percentage entropy). If one wishes to appreciate not only the number of categories but also the values, Shannon entropy could be an alternative for measuring variety. Grouping is not advised, because the disparity measure already covers the ecological distances that can indicate relatedness.
An empirical elaboration
If one wishes to consider the three aspects of diversity—variety, balance, and disparity—in a single measure equivalent to Rao-Stirling diversity, one thus can multiply the corresponding elements in the disparity matrix with the values of the Gini index and relative variety. All three factors are bounded between zero and one and are decomposable. (Note that the coefficient of variation is not bound between zero and one.) One thus obtains the following diversity measure for each unit of analysis (e.g., city) c:
The first term is the relative variety as defined above: the number of valued categories for this city (excluding zeros) divided by the total number of categories (that is in this case, 654; including zeros). The second term is the Gini coefficient of the vector of these nc categories, and the third weights the disparity as a measure for each observation permutating the cells i and j along the vector, but excluding the main diagonal.Footnote 3 The normalization in the third component is needed for warranting that the disparity values (e.g., the Euclidean distance or (1—cosine)) function as weightings between zero and one. As in the case of Rao-Stirling diversity, the cosine-values are taken from the symmetrical cosine-matrix among the 654 column vectors of the asymmetrical matrix of 654 categories versus more than five million patents used by Leydesdorff et al. (2017).Footnote 4
For the computation of the Gini coefficient, I follow Buchan’s (2002) simplification of the computation which the author formulated as follows:
The classical definition of G appears in the notation of the theory of relative mean difference:
where x is an observed value, n is the number of values observed and x bar is the mean value.
If the x values are first placed in ascending order, such that each x has rank i, some of the comparisons above can be avoided and computation is quicker:
where x is an observed value, n is the number of values observed and i is the rank of values in ascending order.
In the following example from Leydesdorff et al. (2017), disparity is measured as (1—cosine) between each two distributions (Jaffe 1989). In this study we compared 20 cities (four cities each in five countries) in terms of the Rao-Stirling diversity of their patent portfolios operationalized as patents granted by the USPTO in 2016. The results are provided in Table 5 (at p. 1584) of that study and compared here below in Table 1 with the values for the new indicator in the right-hand column.
Whereas the left-hand ranking is counter-intuitive in placing Rotterdam and Jerusalem above, for example, Shanghai and Beijing, these latter two cities are attributed the highest rankings using the new indicator. Furthermore, the Rao-Stirling diversity ranges from 0.50 (Wageningen) to 0.83 (Paris), whereas the new diversity index ranges from 0.03 (Marseille) to 0.74 (Shanghai). Figure 2 shows these ranges graphically. The new diversity measure has a stronger resolving power than Rao-Stirling diversity.
The cities under study were chosen so that one could expect differences among them; however, these were smaller than expected using Rao-Stirling diversity. For example, Boston and Rotterdam had the same value on this indicator. Using the new diversity measure, however, the diversity of the portfolio of Boston is more than three times higher than that of Rotterdam.
Table 2 provides the relevant correlations: Spearman’s rank-order correlations are shown in the upper triangle and Pearson correlations on the basis of comparing among these twenty cities in the lower triangle. As could be expected, Rao-Stirling diversity correlates with the Simpson index and Shannon diversity, but not with the Gini coefficient.Footnote 5 The new diversity measure is not significantly correlated with Rao-Stirling diversity or the Simpson index, but—not surprisingly—with the Gini coefficient and with variety; these two factors are constitutive for the diversity in this approach in addition to the disparity.
Conclusions and discussion
The dilemma which remained unsolved using Rao-Stirling diversity, namely of how variety and balance can be combined into “dual concept diversity” (Stirling 1998, p. 48f.), can be clarified using Nijssen et al.’s (1998) argument that the Gini coefficient is a perfect indicator of balance. Since the Gini coefficient is not an indicator of variety; this latter term can be operationalized as relative variety and thus be bounded between zero and one. The three components of diversity—variety, balance, and disparity—can thus be clearly distinguished and independently operationalized as measures varying between zero and one. The new diversity indicator ranges with more resolving power in the empirical case. However, the new diversity indicator did not correlate with Rao-Stirling diversity.
I don’t want to argue for this diversity measure beyond the status of another indicator. Unlike the confusion hitherto, however, the new indicator is based on the solution made possible by Nijssen et al.’s (1998) proof and Stirling’s (1998) analysis of the literature. The independent operationalization of the three aspects of diversity distinguished by Stirling (1998, 2007) provides a more reliable ground than “dual” or higher-order concepts. A routine is provided at http://www.leydesdorff.net/software/diverse for computing both Rao-Stirling diversity and this new indicator (see the Appendix).
The diversity issue is important for the measurement of interdisciplinarity and knowledge integration in science and technology studies. However, the further elaboration of this relevance requires yet another discussion (e.g., Wagner et al. 2011). In Leydesdorff et al. (2018), for example, we argued that a high diversity—measured as Rao-Stirling diversity—in citing patterns may indicate esoteric originality at the journal level and perhaps trans-disciplinarity more than knowledge integration. Uzzi et al. (2013), however, considered atypical combinations in citing behavior at the paper level on the contrary as an indication of novelty.
Notes
\(\mathop \sum \limits_{ij} p_{i} p_{j} = 1\) when taken over all i and j. The Simpson index is equal to Σi(pi)2, and the Gini-Simpson to [1 − Σi (pi)2].
If one wished, one could replace the variety measure with the Shannon function.
A routine for the computation can be found at http://www.leydesdorff.net/software/diverse (see the Appendix).
As can be expected, the coefficient of variation correlated significantly with the Gini coefficient: both Spearman’s rank-order correlation and the Pearson correlation are .94 (p < .01; n = 20).
References
Buchan, I. (2002). Calculating the Gini coefficient of inequality. https://www.nibhi.org.uk/Training/Statistics/Gini%20coefficient.doc.
Frenken, K., Van Oort, F., & Verburg, T. (2007). Related variety, unrelated variety and regional economic growth. Regional Studies, 41(5), 685–697.
Hill, M. O. (1973). Diversity and evenness: A unifying notation and its consequences. Ecology, 54(2), 427–432.
Izsák, J., & Papp, L. (1995). Application of the quadratic entropy indices for diversity studies of drosophilid assemblages. Environmental and Ecological Statistics, 2(3), 213–224.
Jaffe, A. B. (1989). Characterizing the “technological position” of firms, with application to quantifying technological opportunity and research spillovers. Research Policy, 18(2), 87–97.
Junge, K. (1994). Diversity of ideas about diversity measurement. Scandinavian Journal of Psychology, 35(1), 16–26.
Leydesdorff, L. (1991). The static and dynamic analysis of network data using information theory. Social Networks, 13(4), 301–345.
Leydesdorff, L., Kogler, D. F., & Yan, B. (2017). Mapping patent classifications: portfolio and statistical analysis, and the comparison of strengths and weaknesses. Scientometrics, 112(3), 1573–1591.
Leydesdorff, L., Wagner, C. S., & Bornmann, L. (2018). Betweenness and diversity in journal citation networks as measures of interdisciplinarity–A tribute to Eugene Garfield. Scientometrics, 114(2), 567–592. https://doi.org/10.1007/s11192-017-2528-2.
MacArthur, R. H. (1965). Patterns of species diversity. Biological Reviews, 40(4), 510–533.
Nijssen, D., Rousseau, R., & Van Hecke, P. (1998). The Lorenz curve: A graphical representation of evenness. Coenoses, 13(1), 33–38.
Rafols, I., & Meyer, M. (2010). Diversity and network coherence as indicators of interdisciplinarity: Case studies in bionanoscience. Scientometrics, 82(2), 263–287.
Rao, C. R. (1982). Diversity: Its measurement, decomposition, apportionment and analysis. Sankhy: The Indian Journal of Statistics, Series A, 44(1), 1–22.
Rousseau, R. (2018). The repeat rate: From hirschman to stirling. Under submission.
Rousseau, R., Van Hecke, P., Nijssen, D., & Bogaert, J. (1999). The relationship between diversity profiles, evenness and species richness based on partial ordering. Environmental and Ecological Statistics, 6(2), 211–223.
Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press.
Simpson, E. H. (1949). Measurement of diversity. Nature, 163(4148), 688.
Stirling, A. (1998). On the economics and analysis of diversity. SPRU Electronic Working Paper Series No. 28. http://www.sussex.ac.uk/Units/spru/publications/imprint/sewps/sewp28/sewp28.pdf.
Stirling, A. (2007). A general framework for analysing diversity in science, technology and society. Journal of the Royal Society, Interface, 4(15), 707–719.
Theil, H. (1972). Statistical decomposition analysis. Amsterdam: North-Holland.
Uzzi, B., Mukherjee, S., Stringer, M., & Jones, B. (2013). Atypical combinations and scientific impact. Science, 342(6157), 468–472.
Wagner, C. S., Roessner, J. D., Bobb, K., Klein, J. T., Boyack, K. W., Keyton, J., et al. (2011). Approaches to understanding and measuring interdisciplinary scientific research (IDR): A review of the literature. Journal of Informetrics, 5(1), 14–26.
Zhang, L., Rousseau, R., & Glänzel, W. (2016). Diversity of references as an indicator for interdisciplinarity of journals: Taking similarity between subject fields into account. Journal of the Association for Information Science and Technology, 67(5), 1257–1265. https://doi.org/10.1002/asi.23487.
Zhou, Q., Rousseau, R., Yang, L., Yue, T., & Yang, G. (2012). A general framework for describing diversity within systems and similarity between systems with applications in informetrics. Scientometrics, 93(3), 787–812.
Acknowledgement
I thank Ronald Rousseau for comments and stimulating discussions about previous versions of this communication.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
The program div.exe can be retrieved at http://www.leydesdorff.net/software/diverse.
Input files are:
-
Matrix.csv contains the data to be analyzed. Div.exe analyzes column vectors. The file needs to be in.csv (comma-separated variable) style and saved as MS-DOS. The file should not contain a header with variable labels, but only numerical information.
For example:
-
0,2,0,0,0
-
2,1,0,0,5
-
0,0,0,0,0
-
0,0,0,0,0
-
27,0,0,27
-
0,0,0,0,0
-
0,0,0,0,0
-
0,0,0,0,0
-
0,0,0,0,0
-
0,0,8,5,0
In the case under study in this paper, twenty cities are compared in terms of 654 classes of patents. The matrix has twenty columns and 654 rows.
-
Sim.csv contains a symmetrical similarity matrix (e.g., cosine values) in csv-format without a header.
For example:
-
1.0000,0.6270,0.3146,0.1280,0.1564
-
0.6270,1.0000,0.1319,0.0777,0.2190
-
0.3146,0.1319,1.0000,0.4214,0.1322
-
0.1280,0.0777,0.4214,1.0000,0.0865
-
0.1564,0.2190,0.1322,0.0865,1.0000
In the case under study in this paper, the comparison is in terms of 654 classes. The cosine matrix is a symmetrical (1-mode) matrix of 654 * 654 cells with ones on the main diagonal. This file can be retrieved at https://www.leydesdorff.net/cpc_cos/portfolio/cos_cpc.dbf. (Save the file from https://www.leydesdorff.net/cpc_cos/portfolio/ using the right-side mouse knob.)
The output file diverse.dbf contains the following information for each vector:
-
The first column contains the number of the column vector of matrix.csv analyzed.
-
Rao-Stirling diversity;
-
Diversity as defined in this study;
-
Gini;
-
Simpson;
-
Shannon;
-
Hmax
-
Variety;
-
Total number of cases;
-
Number of cases with a value larger than zero.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Leydesdorff, L. Diversity and interdisciplinarity: how can one distinguish and recombine disparity, variety, and balance?. Scientometrics 116, 2113–2121 (2018). https://doi.org/10.1007/s11192-018-2810-y
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-018-2810-y