Abstract
The investigation of the dynamics of national disciplinary profiles is at the forefront in quantitative investigations of science. We propose a new approach to investigate the complex interactions among scientific disciplinary profiles. The approach is based on recent pseudo-likelihood techniques introduced in the framework of machine learning and complex systems. We infer, in a Bayesian framework, the network topology and the related interdependencies among national disciplinary profiles. We analyse data extracted from the Incites database which relate to the national scientific production of most productive world countries at disciplinary level over the period 1992–2016.




Similar content being viewed by others
Notes
It is a product of Clarivate Analytics. Further information are available at https://clarivate.com/products/incites/.
The elaborations reported in this paper are based on indicators exported the 2018-02-26 from InCites dataset updated at 2018-02-10 which includes Web of Science content indexed through 2017-12-31.
The analysed countries are: Argentina (ARG), Australia (AUS), Austria (AUT), Belgium (BEL), Brazil (BRA), Bulgaria (BGR), Canada (CAN), Chile (CHL), China Mainland (CHN), Colombia (COL), Croatia (HRV), Denmark (DNK), Egypt (EGY), Finland (FIN), France (FRA), Germany (DEU), Greece (GRC), Hong Kong (HKG), Hungary (HUN), India (IND), Iran (IRN), Ireland (IRL), Israel (ISR), Italy (IT), Japan (JPN), Malaysia (MYS), Mexico (MEX), Netherlands (NLD), New Zealand (NZL), Norway (NOR), Pakistan (PAK), Poland (POL), Portugal (PRT), Romania (ROU), Russia (RUS), Saudi Arabia (SAU), Singapore (SGP), Slovenia (SVN), South Africa (ZAF), South Korea (KOR), Spain (ESP), Sweden (SWE), Switzerland (CHE) Taiwan (TWN), Thailand (THA), Turkey (TUR), Ukraine (UKR), United Kingdom (GBR), Usa (USA).
The proportionality constant for \(Z_i(\{ J\})\) and \(<\mathbf{s }_i\cdot \mathbf{s }_j>_{i,\{J\}}\) is the same.
References
Antonelli, C., Franzoni, C., & Geuna, A. (2011). The organization, economics, and policy of scientific research: What we do know and what we dont know an agenda for research. Industrial and Corporate Change, 20(1), 201–213.
Albert, R., & Barabási, A. L. (2002). Statistical mechanics of complex networks. Reviews of modern physics, 74(1), 47–97.
Aurell, E., & Ekeberg, M. (2012). Inverse Ising inference using all the data. Physical Review Letters, 108(9), 090201.
Azoulay, P., Graff Zivin, J. S., & Manso, G. (2011). Incentives and creativity: Evidence from the academic life sciences. The Rand Journal of Economics, 42(3), 527–554.
Banerjee, O., El Ghaoui, L., & d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9, 485–516.
Barabási, A. L. (2016). Network science. Cambridge: Cambridge University Press.
Barber, D. (2012). Bayesian reasoning and machine learning. Cambridge: Cambridge University Press.
Besag, J. (1986). On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, Series B, 48(3), 259–279.
Bongioanni, I., Daraio, C., & Ruocco, G. (2014). A quantitative measure to compare the disciplinary profiles of research systems and their evolution over time. Journal of Informetrics, 8(3), 710–727.
Bongioanni I., Daraio C., Moed H. F., & Ruocco G. (2015). Comparing the disciplinary profiles of national and regional research systems by extensive and intensive measures. In Salah, A. A., Tonta, Y., Akdag Salah, A. A. , Sugimoto, C., Al, U. (Eds.), Proceedings of ISSI 2015 15th International Society of Scientometrics and Informetrics Conference, Istanbul, Turkey, 29 June to 3 July, 2015 (pp. 684–696). Bogazii University Printhouse.
Bornmann, L., & Leydesdorff, L. (2013). Macro-indicators of citation impacts of six prolific countries: InCites data and the statistical significance of trends. PLoS One, 8(2), e56768.
Brush, S. G. (1967). History of the Lenz–Ising Model. Reviews of Modern Physics, 39, 883–893.
Daraio C., Fabbri F., Gavazzi G., Izzo M. G., Leuzzi L., Quaglia G., et al. (2017). Assessing the interdependencies between scientific disciplinary profiles at the country level: A pseudo-likelihood approach. In Proceedings of ISSI 2017 The 16th international conference on scientometrics and informetrics (pp. 1448–1459). China: Wuhan University (2017).
Decelle A., & Ricci-Tersenghi F. Zhang P. , (2016). Data quality for the inverse Ising problem. Journal of Physics A: Mathematical and Theoretical, 49, 384001.
de Price, D. J. S. (1965). Networks of scientific papers. Science, 149, 510–515.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741.
Gingras, Y., & Khelfaoui, M. (2017). Assessing the effect of the United States’ “citation advantage” on other countries’ scientific impact as measured in the Web of Science (WoS) database. Scientometrics, 114, 517–532.
Glänzel, W. (2000). Science in scandinavia: A bibliometric approach. Scientometrics, 48, 121–150.
Glänzel, W., Debackere, K., & Meyer, M. (2008). Triad or tetrad? On global changes in a dynamic world. Scientometrics, 74, 71–88.
Glänzel, W., & Schlemmer, B. (2007). National research proles in a changing Europe (19832003). An exploratory study of sectoral characteristics in the Triple Helix. Scientometrics, 70(2), 267–275.
Glänzel, W., Leta, J., & Thijs, B. (2006). Science in Brazil. Part 1: A macro-level comparative study. Scientometrics, 67(1), 67–86.
Guns, R. (2014). Link prediction in measuring scholarly impact: Methods and practice (pp. 35–55). New York: Springer.
Greig, D. M., Porteous, B. T., & Seheuly, A. H. (1989). Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society B, 51, 271–279.
Hu, X. J., & Rousseau, R. (2009). Comparative study of the difference in research performance in biomedical fields among selected Western and Asian countries. Scientometrics, 81(2), 475–491.
Hyvarinen, A. (2006). Consistency of pseudolikelihood estimation of fully visible Boltzmann machines. Neural Computation, 18, 2283–2292.
Judge, G. G., & Mittelhammer, R. C. (2011). An information theoretic approach to econometrics. Cambridge: Cambridge University Press.
Krapivsky, P. L., Redner, S., & Ben-Naim E. (2010). A kinetic view of statistical physics. Cambridge: Cambridge University Press.
King, D. A. (2004). The scientic impact of nations. Nature, 430(6997), 311–316.
Latour, B. (2005). Reassembling the social-an introduction to actor-network-theory. Oxford: Oxford University Press.
Leydesdorff, L., & Zhou, P. (2005). Are the contributions of China and Korea upsetting the world system of science? Scientometrics, 63(3), 617–630.
Li, N. (2017). Evolutionary patterns of national disciplinary profiles in research: 19962015. Scientometrics, 111(1), 493–520.
Marruzzo, A., Tyagi, P., Antenucci, F., Pagnani, A., & Leuzzi, L. (2017). Inverse problem for multi-body interaction of nonlinear waves. Scientific reports, 7(1), 3463.
Mezard, M., & Montanari, A. (2009). Information, physics, and computation. Oxford: Oxford University Press.
Neal, R. M. (1993). Probabilistic Inference Using Markov Chain Monte Carlo. Technical Report CRG-T3-93-1. Department of Computer Science, University of Toronto.
Nederhof, A. J. (1988). The validity and reliability of evaluation of scholarly performance. In A. F. J. Van Raan (Ed.), Handbook of quantitative studies of science and technology, chapter 7 (pp. 193–228). London: Elsevier Science Pub Co.
Newman, M. E. (2003). The structure and function of complex networks. SIAM Review, 45(2), 167–256.
Nguyen H.C., Zecchina R., & Berg J. (2017). Inverse statistical problems: From the inverse Ising problem to data science arXiv:1702.01522v2
Radosevic, S., & Yoruk, E. (2014). Are there global shifts in the world science base? Analysing the catching up and falling behind of world regions. Scientometrics, 101(3), 1897–1924.
Ravikumar, P., Wainwright, M. J., & Lafferty, J. D. (2010). High-dimensional Ising model selection using 1-regularized logistic regression. The Annals of Statistics, 38(3), 1287–1319.
Schubert, A., Glänzel, W., & Braun, T. (1989). Scientometric datales. A comprehensive set of indicators on 2649 journals and 96 countries in all major science elds 19811985. Scientometrics, 16(16), 3–478.
Shen, Z., Yang, L., Pei, J., Li, M., Wu, C., Bao, J., et al. (2016). Interrelations among scientific fields and their relative influences revealed by an input output analysis. Journal of Informetrics, 10(1), 82–97.
Shi, F., Foster, J. G., & Evans, J. A. (2015). Weaving the fabric of science: Dynamic network models of sciences unfolding structure. Social Networks, 43, 73–85.
Tian, Y., Wen, C., & Hong, S. (2008). Global scientific production on GIS research by bibliometric analysis from 1997 to 2006. Journal of Informetrics, 2, 65–74.
Tyagi, P., Marruzzo, A., Pagnani, A., Antenucci, F., & Leuzzi, L. (2016). Regularization and decimation pseudolikelihood approaches to statistical inference in X Y spin models. Physical Review B, 94(2), 024203.
West, J. D., & Vilhena, D. A. (2014). A network approach to scholarly evaluation. In B. Cronin, & C. R. Sugimoto (Eds.), Beyond bibliometrics (pp. 151–166). MIT Press.
Wong, C. Y. (2013). On a path to creative destruction: Science, technology and science-based technological trajectories of Japan and South Korea. Scientometrics, 96, 323–336.
Wong, C. Y., & Goh, K. L. (2012). The pathway of development: science and technology of NIEs and selected Asian emerging economies. Scientometrics, 92, 523–548.
Yang, L. Y., Yue, T., Ding, J. L., & Han, T. (2012). A comparison of disciplinary structure in science between the G7 and the BRIC countries by bibliometric methods. Scientometrics, 93, 497–516.
Zhou, P., & Leydesdorff, L. (2006). The emergence of China as a leading nation in science. Research Policy, 35(1), 83–104.
Acknowledgements
The present study is an extended version of an article presented at the 16th International Conference on Scientometrics and Informetrics, Wuhan (China), 16–20 October 2017, Daraio (2017). This work was supported by the projects Sapienza Awards No. 6H15XNFS and No. PH11715C8239C105.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
In this appendix we discuss the methodology used in this work to obtain the parameters of the maximum Log-Likelihood function introduced in the paper. Firstly, we discuss the general grounds of the validity of the method used. Secondly, we deal with the application to the specific case.
Given the set of data, \(\{\textbf{s} ^{\mu},\mu=1,2,\ldots ,M \}\), assuming that the observed data set are independent, and once defined the generative model, the Log-Likelihood function, \(l(\{J\})\) becomes
where \(\mu =1,\ldots ,M\) is the label for a set of data. The inference problem consists in determining the set of parameters \(\{ J\}\) which maximizes the function in Eq. 5.
We consider here the expression of the cost function (Hamiltonian) for a multicomponent variable \(\mathbf{s }_i=(s_i^1,\ldots , s_i^{\gamma },\ldots ,s_i^D)\), \(H(\{ \mathbf{s }\}|\{ J\})\), given by
with \(J_{ij}=J_{ji}\). The symbol “\(\cdot\)” in Eq. 6 states for a scalar product. The presence of a scalar product ensures that orthogonal or quasi-orthogonal vectors (i.e. countries which have a number of publications whatever large but in different fields) will have a small weight in the cost function. The sum is extended to all couples of nodes (i, j) with \(i \ne j\). The partition function \(Z(\{ J\})\) is
the sum is extended to all possible configurations in the phase space of the set of variables \(\{ \mathbf{s }\}\).
The calculation of the above partition function is too demanding from a computational point of view already for a small number of variables. For this reason, we resort to the pseudo-likelihood approximation (Aurell and Ekeberg 2012, Tyagi et al. 2016, Marruzzo et al. 2017). It consists in maximizing a Pseudo-Log-Likelihood function based on the local conditional Log-Likelihood function at each node (see Eq. 10) in place of the Log-Likelihood function. It is possible to show that the estimation of the parameters obtained by a Pseudo-Log-Likelihood maximization is consistent with the one obtained by the maximization of the Log-Likelihood function, that is the two functions are maximized by the same set of parameters. The hypothesis under which this statement holds, i.e. the strict concavity of the Pseudo-Log-Likelihood function with respect to the elements of the set of parameters, is not too strict (see Hyvarinen 2006). Furthermore it is possible to show that under such a hypothesis the Pseudo-Log-Likelihood maximization is exact (i.e. equivalent to the Log-Likelihood maximization) in the case of infinite sampling (Aurell and Ekeberg 2012). An important advantage of the Pseudo-Log-Likelihood function is that it is possible to maximize it in polynomial time.
According to the Pseudo-Log-Likelihood approach, we consider the likelihood built on the local conditional probability on each variable i, one by one. Instead of Eq. (5), the cost function (Eq. 6), is first rewritten as
where \(\mathbf{s }_{\backslash i}\) indicates the set of all input-variables except the ith. The functions \(\mathbf{A }_i(\{J\})=\frac{1}{2} \sum _{j }^{1,N} J_{ij} \mathbf{s }_j\) and \(\mathbf{B }_{i,k}(\{J\})=\frac{1}{2} \sum _{j \ne i}^{1,N} J_{ij} \mathbf{s }_j\) have been introduced in Eq. 8. The cost functions \(H_i (\mathbf{s }_i | \{ \mathbf{s }_{\backslash i} \},\{ J\})\) and \(H_{\backslash i} (\{ \mathbf{s }_{\backslash i}\}|\{ J\})\) are implicitly defined in the same equation. Analogously we can rewrite the partition function as
The local conditional probability at the ith node is
and the local partition function is \(Z_i(\{ J\})=\sum _{\{\mathbf{s }_i\}}e^{-H_i (s_i | \{ \mathbf{s }_{\backslash i} \},\{ J\})}\). By defining \(l'( \mathbf{s }_i|\{ \mathbf{s }_{\backslash _i}\}|\{ J\})=\log [p(\mathbf{s }_i | \{\mathbf{s }_{\backslash i} \},\{ J\})]\), the Pseudo-Log-Likelihood function is defined as
The gradient of the Pseudo-Log-Likelihood function with respect to the parameter \(J_{ij}\) is given by
where \(<\ \ >_{i,\{J\}}\) states for ensemble average calculated over the probability distribution \(p(\mathbf{s }_i | \{\mathbf{s }_{\backslash i} \},\{ J\})\). Looking now at the gradient of the Log-Likelihood function, l(J) we observe that it is possible to rephrase the term \(\frac{1}{Z(\{ J\})}\frac{\partial }{\partial J_{ij}}Z(\{ J\})\) as
Finally we obtain
By comparing Eqs. 12 and 14 it is possible to infer that in the limit \(M \rightarrow \infty\), i) both the gradients go to zero for the set of parameters \(\{ J\}\) generating the observed data, ii) \(\frac{\partial }{\partial J_{ij}} \lambda (\{ J\}) \rightarrow \frac{\partial }{\partial J_{ij}} l(\{ J\})\). This finally establishes the consistency of the maximum Pseudo-Log-Likelihood estimator. We observe, furthermore, its coincidence with the maximum Log-Likelihood estimator in the limit \(M \rightarrow \infty\).
The gradient of the Log-Pseudo-Likelihood function can be calculated exactly, thus facilitating the computational solution of the inference problem. The explicit expression of \(\frac{\partial }{\partial J_{ij}} \lambda (\{ J\})\) is reported in the following.
To deal with a lower number of parameters in place of maximizing the Pseudo-Log-Likelihood function, given by the sum of the single-node Pseudo-Log-Likelihood functions (Eq. 11), we maximize each single-node Pseudo-Log-Likelihood function. Since the couplings should be symmetric the final estimate of the \(J_{ij}\) parameter is obtained by taking the average \((J_{ij}+J_{ji})/2\).
Using a standard Pseudo-Log-Likelihood maximization some coupling can be largely overestimated. To avoid such a drawback we used a \(l_2\) regularizer (Ravikumar 2010), i.e. in place of maximizing the \(\lambda (\{ J\})\) function we maximize the function \(\lambda (\{ J\})-l_2(\sum _{i,j}J_{ij}^2)^{1/2}\), where \(l_2\) is a suitable chosen constant.
The maximization of the single-node Pseudo-Log-Likelihood functions has been performed by means of the MATLAB fminunc package by selecting a trust-region optimization algorithm.
In the following we first rephrase the expression of the Log-Likelihood function by isolating the contribution of the ith node and compare with the expression of the Log-Pseudo-Likelihood function to a deeper understanding of the differences between them. We finally calculate the gradient of the Pseudo-Likelihood function with respect to \(J_{ij}\). The sum \(\sum _{\{ \mathbf{s }_i\}}e^{- \mathbf{s }_i \cdot \mathbf{A }_i(\{J\})}\) in Eq. 9 has been calculated by assuming that the values of the ith input variable can continuously vary in the interval \([-\,1,1]\), obtaining
The proportionality constant in Eq. 15, equal to the inverse of the total number of all possible \(\mathbf{s }_i\) configurations, does not influence the following derivations and it will be not explicitely considered. Similarly, it is possible to write the function \(Z_{\backslash i}(\{ J\})=\sum _{\{ \mathbf{s }_{\backslash i}\}}e^{-H_{\backslash i} (\{ \mathbf{s }_{\backslash i}\}|\{ J\})}\), by exploiting the function \(\mathbf{B }_{i,k}(\{ J\})\) defined above, obtaining
By iterating this procedure to the remaining variables it is finally possible to write the partition function as the product
The Log-Likelihood function becomes
The Pseudo-Log-Likelihood function, defined in Eqs. 10 and 11, takes now the expression
The difference between the Log-Likelihood function and the Pseudo-Log-Likelihood function clearly appears by comparing Eqs. 18 and 19.
We can now explicitly calculate the gradient of the Pseudo-Log-Likelihood with respect to the set of parameters \(J_{ij}\). From Eq. 12, we need to calculate the quantity \(<\mathbf{s }_i^{\mu }\cdot \mathbf{s }_j^{\mu }>_{i,\{ J\}}\). It is (for sake of clarity the index \(\mu\) is omitted)
The expression of \(Z_i(\{ J\})\) is reported in Eq. 15. It is possible to rewrite it, for a given index \(\gamma\), as \(Z_i\propto \frac{2\ \ \sinh (A_i^{\gamma })}{A_i^{\gamma }} \prod _{\alpha \ne \gamma }^{1,D} \frac{2\ \ \sinh (A_i^{\alpha })}{A_i^{\alpha }}\). By inserting this latter expression in Eq. 20, we obtain
and finallyFootnote 4
When we are dealing with interrelations among two different disciplines, labeled as γ and δ, in place of Eq. (6), the Hamiltonian of the system is \(H=-\frac{1}{2}\sum _{i,j} J_{ij}^{\gamma \delta }(s_i^{\gamma }s_j^{\delta }+s_i^{\delta }s_j^{\gamma })\). In this case, Eqs. (15), (21) and (22) should be changed consistently. This, however, does not introduce any further drawbacks. For example Eq. (A11) becomes
where \(A_i^{\gamma }(\{ J^{\gamma \delta }\})=\frac{1}{2}\sum _{j}^{1,N}J_{ij}^{\gamma \delta }s_j^{\gamma }\).
Rights and permissions
About this article
Cite this article
Daraio, C., Fabbri, F., Gavazzi, G. et al. Assessing the interdependencies between scientific disciplinary profiles. Scientometrics 116, 1785–1803 (2018). https://doi.org/10.1007/s11192-018-2816-5
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-018-2816-5