Two Decades of Research Assessment in Italy. Addressing the Criticisms

Italy is the single largest country in Continental Europe to have adopted a regular and mandatory research assessment approach, involving all researchers at universities and Public Research Organizations (PROs), with impact on performance-based funding. With more than 180,000 products, evaluated by more than 14,000 referees, the 2004–2010 exercise carried out by a newly created Agency (ANVUR) was one of the largest ever carried out. It has adopted a peculiar mixed-methodology approach, using peer review in Social Sciences and Humanities (SSH) and bibliometrics in STEM disciplines. The approach has raised a number of conceptual and technical issues. In parallel a major reform of academic recruitment has introduced quantitative indicators as threshold values for candidates to the National Scientific Habilitation. This procedure has been made possible by a massive exercise of classification and rating of journals. The paper addresses the most important criticisms raised against these research assessment initiatives and checks their arguments against empirical evidence. The paper also addresses the controversial issue of unintended and negative consequences of research assessment. The final section offers some policy highlights.


Introduction
In a companion paper  I have described the Italian experience of research assessment in the last two decades . Research assessment in Italy has been similar in size to the RAE/REF in the UK, but has been the most comprehensive by scope and impact. It has included a formal assessment exercise (VQR), a journal rating exercise, and the production of quantitative indicators used for the National Scientific Habilitation. In a relatively short time span of a few years, it has affected the funding and reputation of universities, the internal allocation of resources across disciplines and departments, and the careers of researchers. Given the pervasiveness and impact, it is not surprising that it has attracted lot of criticism.
In this paper I will try to address systematically the most important criticisms that have been published in refereed journals. I will deliberatly ignore the national press (including newspapers) and the web media, with a few exceptions only. I will mainly use papers in English, for the benefit of readers. For a rich account of the debate in Italian language see the introduction in Fassari and Valentini (2020), with more than 200 references. I hope to be able to give full justice to the authors and to the counterarguments offered by ANVUR.
In reviewing the criticism I will take into account an important document produced by an independent expert panel upon request of ANVUR (Group of experts, 2019). 1 This report includes a number of recommendations.
The discussion will be somewhat technical, that is, related mainly to methodology, choice of indicators, choice of algorithms and aggregation formulas. Readers more interested in high level discussions about the purposes, objectives and impact of research assessment might benefit from reading my companion paper before this one.
For any single piece of research it is very unlikely that we have found the optimal combination. Facing with individual items to be evaluated it is common experience that the agrement among referees is very large at the opposite tails of the distribution of evaluations (i.e. for extremely brilliant and for poor products), while it is much lower in the centre of the distribution. Consequently we cannot expect strong correlation between scores assigned by referees and bibliometric indicators at the level of individual publications. In this perspective, I find too ambitious the statement, put forward by Bertocchi et al. (2015) according to which peer review and bibliometrics can be used interchangeably. This would require a high level of Cohen's kappa coefficient. But we do not need this level of agreement. What we need is to be sure that the error we make by using both methodologies for reasons of necessity, is acceptable on aggregate with respect to the ideal situation in which the products were evaluated entirely with just one of the methodologies. What we mean by acceptable must be defined ex ante. This point has been made in a compelling way by Traag and Waltman (2019) who review the large literature on peer review-bibliometric agreement.
This perspective is taken with respect to the Italian experience by a very recent paper by Traag, Malgarini and Sarlo (2020). They exploit the fact that 10% of all journal articles submitted to the VQR 2011-2014 have been also sent for peer review to two independent referees. They found that the degree of agreement between bibliometric and peer review is actually larger than the degree of agreement between two referees. This holds for citation indicators as well as for journal-level indicators. At the aggregate level (not individual level) the error introduced by combining peer review and bibliometrics is therefore not larger than it would be when using any other known combinations of methods. I find this argument appropriate.

Use of journal indicator
As described in Ancaiani et al. (2015) the expert groups that adopted the bibliometric methodology combined two indicators: the normalized number of citations and a journal-level indicator. A classical objection to the use of journal indicators in research assessment is that these indicators capture only the mean of the distribution of citations, while it is well known that citations are distributed in a very skewed way (Radner, 1998;Todorov and Glänzel, 1988;Glänzel and Moed, 2002;Leydesdorff, 2008). In addition the impact factor is not transparent (Seglen, 1997;Mingers and Leydesdorff, 2015), the time frame is too short and the correlation between impact factors and average citation impact of articles became weaker over time (Lozano et al 2012). Consequently, most authors warn against the use of journal indicators for the evaluation of individual researchers (Moed and van Leeuwen, 1996;van Leeuwen and Moed, 2005;Jarwal et al. 2009;Marx and Bornmann, 2013; for a recent and updated review see Larivière and Sugimoto, 2019). Based on these arguments, the use of journal indicators in the bibliometric evaluation of VQR has been criticised in general terms, although there is no published article addressing the potential distortions of its use.
In order to examine whether the inclusion of journal-level indicators has produced serious distortions it is important to check whether (i) it is combined with other indicators; (ii) it is used for individual evaluation.
First of all, journal indicators are used jointly with normalized citations. These two indicators are assumed as (imperfect) signals of the quality of research, as reflected in the practice of the relevant research community. The final score is constructed automatically if the signals are strongly correlated, while it is assigned by experts (peer review) if the signals show little correlation.
Second, while the unit of analysis is the individual product submitted by universities, upon initiative of individual researchers, it is clear that the unit of evaluation is not the individual researcher at all. It is an aggregate of researchers: the scientific field (Settore Scientifico Disciplinare, or SSD in the Italian administrative language), with a minimum of 4 researchers in order to preserve confidentiality, the department, the university or PRO.
Following the recommendations of the literature, journal-level indicators have not been used for the evaluation of individual researchers. In the procedure for the assessment of individual researchers (National Scientific Habilitation) the bibliometric indicators used by ANVUR in STEM fields do not include journal impact factors. Candidates are admitted to the qualitative evaluation by a committee of Full professors at national level if they overcome a threshold in a combination of the number of indexed publications, the count of citations, and the h-index. No mention of journal indicators at all.
Why did ANVUR include journal-level indicators, given their controversial nature? It was considered that journal indicators were useful for very recent articles, for which the citation window would be too short to formulate a fair judgment. In addition, maintaining journal impact factors as elements of the VQR evaluation was intended as a way to suggest to young researchers the importance of publishing in good international journals, rather than splitting papers in lower level ones. 3 Having said that I recognize that there is a large international debate on the advantages and disadvantages of using journal-based indicators. On this point it is important to keep the debate open and balance carefully the arguments, without taking strong a priori positions. A couple of recent papers reopened the debate, bringing evidence in favour of the use of journal-based indicators (Traag, 2019;Waltman and Traag, 2019).
The independent Expert group advised ANVUR to eliminate journal-level indicators from the bibliometric methodology (Expert group, 2019). The arguments are as follows.
3.7 The bibliometric algorithm that combines these indicators is very mechanical. The choice of databases is left to the submitter, and the relative weight of journal impact factors and article cites is estimated within database journal classifications (which do not coincide with VQR assessment clusters, and may or may not be unambiguously meaningful) and item publication date. These features enhance impartiality, but make the results somewhat opaque and hard to interpret: these mechanical criteria are quite complex, with different weights and several parameters so that the final formula may be perceived as arbitrary.
4.3.10 For sub-disciplines, or sub-GEVs where citation numbers are appropriate, IRAS1 could be simplified and probably optimised by using solely citation numbers, where ranking into the top percentiles will be defined objectively for each sub-discipline or sub-GEV. The use of straight citation counts thus avoids the use of Journal Impact Factors, which are a marker of the average impact of the journal and does not add to the impact of specific publications.

Aggregation algorithm
In STEM fields the class of merit is assigned by combining percentiles of the world distribution of citations in the same Subject Category and percentiles of journal indicators.
The combination of multiple indicators creates an obvious technical problem of aggregation. As a starting point, it would be important to state that aggregation may take place according to a variety of criteria, none of which can claim universal validity. It is a matter of design of a procedure according to stated principles, implementation, and ex post examination of statistical properties.
The main approach of ANVUR, as already stated, was to leave the expert panels free to assign different weights to citations and journal indicators in defining the final score for those articles for which there was no correlation between the indicators. As described more in detail in Ancaiani et al. (2015) and Anfossi et al. (2016) and summarized in , experts may give more weight to journal-level indicators or to normalized citations in the formula that combines the two metrics in order to assign a class of merit, hence the final score for the item. This choice reflects the overall practice in the scientific community and is justified in the final Report of each GEV.
A serious difficulty that ANVUR had to face was that the Ministerial decree opening the first VQR gave detailed indication of the classes of merit (see Table 1 of the companion paper for details) and of the quantiles associated to each of them. Following this prescription, the square space defined by the percentiles of the two indicators was partitioned into a series of squares whose boundaries were defined by the percentiles of the final classes of merit. These quantiles were not uniform (e.g. quartiles), but variable in their range: in particular, a large region of the distribution (below the median) was to be evaluated as Limited. Under these circumstances, it is easy to demonstrate that small changes in one of the indicators would result in large changes in the final classification. This objection was raised shortly after the first VQR (2004)(2005)(2006)(2007)(2008)(2009)(2010).
The procedure has been the object of criticism (Abramo and D'Angelo, 2015;Franceschini and Maisano, 2017). Since the initial procedure (2004)(2005)(2006)(2007)(2008)(2009)(2010) forced the assignment of discrete classes to each of the criteria, the variability of the final score was large for borderline cases.
ANVUR addressed this issue and modified the aggregation algorithm. This has been corrected in the 2011-2014 exercise, by making the final score a continuous, not discrete, number. In the second VQR (2011-2014) ANVUR introduced a modification of the aggregation algorithm, illustrated in Anfossi et al. (2016). 4 My interpretation of the debate is as follows. I find a logical contradiction in some of the criticisms. On the one hand, it is regularly argued that bibliometric evaluation should make use of multiple indicators, and avoid the reliance on single indicators. I find this position well grounded in the literature and methodologically sound (AUBR, 2010;Setti, 2013;Moed, 2007;. On the other hand, however, once one starts to aggregate multiple indicators into a single of impact in those cases where the paper is too young, field normalizations are not taken into account or if autocitations strongly affect the final evaluation" (Anfossi et al. 2016, 673). 4 In a nuthshell, the overall square space defined by the range of the percentiles for the two indicators has been divided into oblique strips whose slope is a function of the weights assigned to them by the expert panel. Franceschini and Maisano (2017a;2017b) criticized this solution, arguing that any non-linear transformation of the two indicators leads to a distortion of the statistical properties of the underlying indicators.  replied to this criticism by calculating the proportion of articles that might be misclassified by using the new algorithm (see Benedetto and Setti, 2017 for technical details). They conclude that the effect can be estimated as being 0,05% of the articles that could be submitted by Italian authors, that is, negligible. evaluation, the argument is raised according to which any transformation would create unacceptable distortions, whatever the order of magnitude of the error. In particular, the arguments of Franceschini and Maisano (2017a), made reference to a paper by Thompson (1993). In this paper it is argued that if percentile ranks are not based on equal scales (which is not the case in our context), then percentiles should never be added or averaged, if one wants to avoid large distortions. In the case of VQR, percentile ranks from citations and from journal indicators are neither added nor simply averaged, but combined in a weighted way, controlling for the resulting error. They are taken as signals of quality, divided in classes of intensity of the signal. The final score is the result of a broad convergence (same class of merit, or intensity of the signal), or the result of a weighted appreciation of the informativeness of the signal (slightly different classes of merit for the two indicators). By modulating the parameters of the aggregation algorithm (see Anfossi et al. 2016 for discussion) it i possible to mi nimize the errors of classification. It is a matter of measurement, not of admissibility of the procedure. If the signals are contradictory, peer review is adopted.
In using multiple indicators a balance must be defined between sophistication and transparency. The combination of two bibliometric indicators enlarges the basis for judgment. While for individual articles there is some risk of misclassification, the error is very small at aggregate level. The issue of aggregation would be automatically solved if the use of journal-level indicators will be eliminated in the next VQR 2015-2019.
The independent Expert group (2019) suggested to eliminate the classes of merit and use instead five quality criteria, each with 8 points available for evaluation. They also recommended to use one method only within any GEV, to be indicated in the Call. In the case of bibliometrics, they recommend to use only one database, again to be indicated in the Call.

Selection of products vs full production of scholars
A few authors argued that it would be better to evaluate all products in a given time window, rather than a sample of self-selected publications (Abramo, D'Angelo and Di Costa, 2014). The criticism was based on two arguments. First, submitting only 3 or 2 items per capita results in a compression of the variability of performance of researchers. Second, researchers may be unable to select their best papers.
The two arguments have grain of truth but do not stand after further reflection. It is certainly true that highly productive researchers publish much more than 3 or 2 papers. However, evaluating all products is not feasible. The first obstacle is obvious: bibliometric evaluation can be done reliably only for STEM disciplines. It would be difficult to defend the idea that scholars in STEM are evaluated on their full production, while scholars in SSH are evaluated only for a subsample.
The second obstacle is that for non-indexed items there is no authoritative source of classification that may permit to carry out peer review only on scientific items. The cost of submitting to peer review the entire production of scholars in SSH would be huge. As a matter of fact, the only country that has submitted all products to evaluation is Australia, which decided to evaluate only STEM fields.
As to the second objection, it is true that scholars may make mistakes in their submissions. But, again, several considerations make this issue irrelevant. First, the Italian VQR is by design and legislative mandate an exercise in which all researchers must be involved. The early experience of VTR, in which individual researchers had no voice in choosing the products, was not considered positive. With some hindsight, making individual researchers responsible for the selection of products has been an important step for the legitimation of the exercise. Compare this with the recurring controversies in the UK about the filtering role of departments and universities in the decision about who will be subject to evaluation (Sayer, 2015).
Second, researchers may learn about their best papers. Once the main evaluation criteria have been established, researchers learn rapidly which products are to be submitted. As a matter of fact, several researchers and even a university developed software programs (available in Open Source) to advice scholars about their submission choices.
Finally, in all fields in which there are several co-authors, there is a major decision, to be made at university level, in order to avoid double-counting (i.e. the same co-authored paper may be submitted by all affiliations involved, but not twice within the same affiliation). This means that there will be some negotiation between researchers and their Research offices, with the latter adopting an objective function defined in terms of maximization of the overall institutional score.
Summing up, the criticism is not realistic in the context of VQR. It remains true, however, that a major goal for research assessment would be the construction of an official repository of all publications of all researchers. As a matter of fact, ANVUR fully recognized the importance of submitting to assessment the entire scientific production of researchers. The Italian legislation included since 2009 a provision for the full publication of metadata on all products of researchers affiliated to universities, in a National Register of publications. 5 This infrastructure was in the mandate of the Ministry of University and Research (MIUR), since the law delegated the Ministry to develop the implementation. The infrastructure could benefit from the legacy of so called loginmiur. As described in the companion paper , this is a platform managed by a large IT consortium of universities (CINECA), in which all researchers deposit the metadata of all their publications. The platform is routinely used by the Ministry for administrative procedures (e.g. submission of proposals) but it is not open to the public. Immediately after its creation, ANVUR approved a document 6 in which it proposed a framework for the creation of the Register based on loginmiur. It also started a series of technical meetings to accelerate the disclosure of all information in the platform, as the first step towards a validated Register. It turned out that the official obstacle to the publication was the need to define the profile of privacy of researchers, after the recent introduction of a strict legislation. As a matter of fact, this legislative provision is still waiting for implementation, after 11 years.
In the same line of engagement, ANVUR developed a proposal for an experimental facility of collection of metadata in SSH fields, covering non-indexed journals, particularly in national language. A task force 7 developed a technical feasibility proposal, which was shared with the main academic publishers and the largest digital platform for academic journals (Torrossa). The proposal was aimed at establishing a pilot digital platform that would have aggregated and made retrievable the table of contents of a large number of journals in national language and would have extracted automatically the metadata, using machine learning techniques. In this way it would have been possible, after a few years of experimentation, to track the entire journal production of Italian researchers in those fields not covered by bibliometric databases. The proposal included a provision for experimental research on citation extraction from nonindexed sources, for which technical tests had been done with promising results using text mining techniques.
It is useful to remark that this initiative was taken after the feasibility study proposed by Henk Moed (Moed et al. 2009; see also Hicks and Wang, 2009), but much earlier than similar ideas were popularized at European level, particularly by the COST initiative ENRESSH. 8 As a matter of fact, large bibliographic databases in SSH are available mainly in small European countries (Sile et al. 2018), so that the Italian initiative would have been the first one for a large European country. No comparable open repositories are available still today in large European countries. CRIS systems are still not diffused.
The proposal was then discussed in a large conference, held in Rome in 2013 with the editors of academic journals and the scientific societies in all SSH fields. To the surprise of the ANVUR organizers, there was a well organized strong and negative reaction by scientific societies in legal disciplines. The argument was that the proposal was a hidden attempt to introduce bibliometrics into the SSH fields. After collecting metadata, the argument was, ANVUR might have been in the position to start bibliometric analysis against the opinion of scholars. There were even arguments about the legal obstacles to collect data on publications of scholars.
Summing up, the proposal of submitting all publications of scholars to evaluation is valuable in principle, but it must overcome several serious obstacles to become practical. It is a good argument in general, while it cannot be used to dismiss the VQR experience.

Cost of the exercise
Another issue that was raised after the start of the VQR was the total cost of the exercise. Geuna and Piolatto (2015) estimated the total cost of the first VQR at approximately 10 million euro (excuding the costs for universities) a level which is in line with the UK experience. In the UK, however, the research assessment is responsible for a larger share of performance-based funding than in Italy. The authors therefore recommended to increase the share of government funding allocated according to merit criteria, in order to justify the expenditure for the assessment procedure.
A careful analysis of the costs of VQR, compared with the REF, has been carried out in the context of the 2018 biannual Report on the State of the research. 9 It is estimated a cost eight times smaller than the REF.
Another estimate, which circulated widely in the media, was proposed by Sirilli (2012). 10 This author assumed that the shadow cost of referee work for the peer review exercise was represented by the average pay at the European Commission level. The resulting estimate is an astonishing 300 million euro total sum. This paper is a nice example of a deeply flawed reasoning. On the one hand, it is hard to believe that the opportunity cost for all researchers, in all fields, for all working days of the year, is equivalent to the daily fee at European Commission. If this were true, than most of academic activities would be immediately halted, since they do not provide this level of income at all.
On the other hand, the reasoning violates simple economic principles. All referees of the VQR were asked whether they were prepared to carry out peer review in exchange for a payment of 30 euro. Very few declined. More than 14,000 referees accepted and carried out the work. If all these people considered that their opportunity cost was 450 euro per day, then the vast majority of them would have declined altogether an offer of 30 euro per paper. If most of them accepted it is because they considered the payment adequate. In economic terms, if they were free to accept the offer, their acceptance is a clear indication that they consider the benefits they receive as comparable to their effort. Their revealed preferences, economists would say, mean that the value of their effort is equal to the price received (plus maybe unobserved intrinsic value).
Needless to say, the collaboration of referees was not mainly the result of economic calculations, but most likely the consequence of a feeling of academic obligation and institutional compliance. What does it mean for the estimate of the cost of VQR? According to Sirilli the time of referees is taken away from research and must be evaluated separately as an additional cost. If this argument is true, why are researchers all over the world doing peer review for free? Are they detracting from their research duties? The argument does not stand. Peer review is an integral part of the academic work. Referees learn a lot in doing peer review and have the opportunity to have an impact on the direction of science. The overall cost estimation is therefore grounded on wrong economic assumptions.

Journal rating
As stated in the companion paper, in order to produce the thresholds of indicators for the admission of candidates to the National Scientific Habilitation, ANVUR had to classify journals in scientific and non-scientific, and to create a list of A-class journals, to be used for SSH disciplines. Most of these journals were in national language and were non-indexed in the citation indexes Web of Science or Scopus.
As it is well known, journal rating is a controversial issue in the literature. While most authors agree on the notion that scientific journals are different from non scientific ones, the ranking into merit classes is contested. According to some authors, journal rating is a source of conformism that depresses original and unorthodox research (Willmott, 2011;Alvesson and Sandberg, 2013;Mingers and Willmott, 2013;Mingers and Yang, 2017). According to Rafols et al. (2012) journal rating inhibits interdisciplinary research. The Italian experience came after the demise of the journal ranking experiences in Australia and France (Pontille and Torny, 2010). At the same time there were promising experiences in Spain (Giménez-Toledo et al. 2007;2013) and in several other countries and the expert-based rating of journals was recommended in the literature as a suitable alternative to bibliometrics (Nederhof and Zwaan, 1991;Nederhof, Luwel and Moed, 2001;Hicks and Wang, 2011).
On the practical side, the ranking was anyway mandated by the Ministerial decree with a fixed deadline and there was no time to enter into an extensive consultation. ANVUR adopted an expert-based classification, in which the opinions of learned societies were a necessary input. The overall approach was reputational, not indicator-based. It was felt that the overall scientific community had a certain agreement on those journals that are essential to the advancement of the disciplines. The expert panel could make use of referees' opinion and use their judgments to support a decision. All learned societies were asked to produce a list of A-rated journals. In the absence of a mandatory upper limit on the total number of journals or percentage of the total in the A class, many learned societies suggested a very large proportion of the total. The short time window made it difficult for some learned societies to contribute. As a matter of fact, some of these opinions were missing while others could not be accepted. The expert panel convened by ANVUR took the responsibility to draft the final decision. Several mistakes were soon discovered.
The publication of journal lists generated a wave of discussion in the academic community. It appeared soon that there was a need to strenghten the dialogue with learned societies and to make the procedure of journal classification open to appeal and renewal. In 2013 a new regulation was issued, in which it was possible for editors of journals to submit new candidatures and to appeal against the rejection into class A, providing new evidence about the reputation in the scientific community. It was stated that the submission of candidatures would be an annual procedure. The process started and several new expert panels, in substitution of the initial one, were nominated. A massive process of examination of new submission started again in 2013 and was kept open at regular intervals.
The issue became even more sensitive in 2016, given that the new procedures for the Habilitation made the threshold much more relaxed, but asked candidates to overcome two out of three thresholds without exceptions. While in the 2012 procedure the members of the committee could choose to depart from the median values by publishing a motivation before the examination of candidates (art. 6, Ministerial Decree n.76/2012), this provision was eliminated in the new procedure. As discussed more in detail in , in practice all candidates had a keen interest in the boundaries of the A-class, given that small changes could easily revert the admissibility status. A number of legal actions were taken. This provision forced a process of administrative bureaucratization. An immediate consequence was that the new expert panels, to be nominated in 2017, could not be chosen by ANVUR but were selected after an open call procedure for selfcandidature. All steps were subject to the strict requirements of administrative law and with an eye to avoid legal actions.
In evaluating the practice of journal rating an important test is whether the rating of a journal is a good predictor of the quality of the articles that are published in it. This is the classical problem of journal-based indicators, but here we entirely miss the citation-based indicators at article level. An interesting opportunity was offered by the parallel deployment of VQR and Habilitation. As stated above, the VQR 2004-2010 started in 2011 and was published in 2013. During 2012, therefore, members of the GEV in the non-bibliometric fields were requested to carry out peer review on journal articles submitted by all researchers, irrespective of their academic rank. Exactly in the same period the expert panel was drafting the list of A-class journals for the Habilitation. Within the GEV activities a procedure for journal rating was implemented with the aim to provide external referees with additional information, following the informed peer review approach. A small number of top journals were selected by the GEVs, after consultation with learned societies. These lists were also made available to the expert panels in charge of journal ratings within the Habilitation procedure.
It was then possible to examine to what extent the score assigned to individual articles was correlated with the merit class of the journal. Bonaccorsi et al. (2015) and Ferrara and Bonaccorsi (2016) provided evidence that, controlling for other factors (language of the journal, disciplinary field, academic rank of the author) the probability to receive an Excellent score was twice as large for articles published in A-rated journals as it was for papers in other journals. This argument was criticized by Baccini (2016) according to whom a regression model is not appropriate given that the variables are not independent. Referees who knew about the rating assigned by the GEV were influenced in evaluating individual articles, so that the evaluation at article level is not independent on the evaluation at journal-level. Bonaccorsi et al. (2018) re-examined the issue and offered further arguments. First, the set of A-class journals under the Habilitation was much larger than the small set used for the VQR procedure, three times larger. Second, not all journals rated in A class for the VQR were also rated for the Habilitation. Third, referees under the VQR were instructed to formulate their judgment only after careful reading of the article, using the information on the A-class only as a supporting information. Under the VQR researchers submitted a small selection of their best articles, while for the Habilitation they submitted all their publications (or a large selection). This means that the proportion of articles from top journals was by definition larger for the VQR. Referees had the task of evaluating whether articles published in top journals were indeed excellent, or only good, or even only adequate. Experienced referees know very well that within top journals one can find by definition articles of better average quality, but also, more or less occasionally, articles of lower quality or even poor ones. If one were to believe that referees just rate mechanically as excellent all articles published in A-rated journals (whatever the size of the A class) then it would be meaningless to use peer review. Summing up, Bonaccorsi et al. (2018) maintained that between the variables at journal and individual level there was sufficient independence to warrant the regression approach and rejected the argument by Baccini (2016).
Journal rating is still in place. In a publication landscape with more than 15,000 journals in SSH, as witnessed by the initial loginmiur journal list, it was important to intoduce some criteria for quality. When the exercise started, very few journals in SSH had a formal practice of ex ante peer review. After a few years many more journals have adopted the peer review procedure. The competition for entering the A-class motivated many journals to improve editorial policies, boards, and selection criteria. An argument often raised against journal rating is that it might induce comformism and orthodoxy, preventing the birth of new journals or excluding journals outside the mainstream. After several years in line of procedures for the admission of new journals and for the revision of rejection decisions, I can add a skeptical note on this argument. By consulting the lists of journals published in the ANVUR website it seems that the entry of new journals in the last few years has been continuous. More research is clearly needed for a more balanced assessment.

Academic promotion
As largely discussed in the companion paper, one of the main effects of the introduction of research assessment has been the utilization of quantitative indicators as admission thresholds for the habilitation of candidates for academic promotion. In that paper I have illustrated the context of the reform, characterized by a long tradition of lack of transparency, in which promotions by seniority, irrespective of the research performance, have traditionally been diffused and the trasparency of promotion criteria has been modest, if any. I refer to the companion paper for references.
The 2010 reform has drastically changed the overall landscape of academic careers, introducing a new system called Abilitazione Scientifica Nazionale (ASN), or National Scientific Habilitation, and placing a heavy pressure on the research record. The introduction of thresholds based on the median value of the distribution of indicators was criticized with harsh comments, within a more general argument against metrics. After a few years it is possible to examine whether the introduction of indicators has damaged the academic system and, more generally, what has been the overall impact. In doing so several authors have exploited an unusual level of administrative transparency, insofar as all the documentation of promotion procedures is publicly available (CV and list of publications of candidates; selection criteria adopted by the committee; individual judgments).
I examine here the impact of the new procedure in terms of mitigation of some of the problems that have afflicted the Italian academic promotion system in the past and that have motivated the 2010 reform, that is promotion by: (i) academic connection; (ii) seniority; (iii) non-publication criteria; (iv) gender.

Promotion by connection
In an ideal world, the existence of personal connections between candidates and the members of evaluation committees should not enter into the promotion decision. In reality, a large literature shows that this is not the case, with connections creating an advantage for some of the candidates.
Exploiting the feature of the Habilitation procedure according to which the composition of evaluation committees and the list of candidates are published on the ASN website, Bagues, Sylos Labini and Zynovieva (2019) have studied the role of connections for the academic promotion. A connection is defined as the presence in the committee of a co-author, a colleague (same university) or a PhD advisor of the candidate. Given the tradition of favoritism largely discussed in the companion paper (which however was also found in other countries such as France and Spain), the expectation of the study was to find a large connection premium, that is, a significantly larger probability of promotion for connected candidates. The conclusions are surprising: "We find that connected candidates are 4.6 p.p. (13%) more likely to qualify. Instead, Zinovyeva and Bagues (2015) find that in the Spanish system of national qualification evaluations, where evaluation reports are not publicized, the (exogenous) presence of a connection in the committee increases candidates' chances of qualifying by around 50%. Similarly, the work of Perotti (2002) suggests that the impact of connections was significantly higher in the evaluation system that was in place previously in Italy" (Bagues, Sylos Labini and Zynovieva, 2019, 96). In other words, the Habilitation system has apparently curbed one of the most lasting attitudes of Italian academia.

Promotion by seniority
Another distortion that was frequently denounced is the promotion of academic staff who are no longer active in research, only on the basis of seniority. Marini (2017) examined the results of the Habilitation procedures in physics, engineering, economics and law in order to verify whether the seniority of candidates (i.e. the number of years in the current rank) still plays a large role, as in the past. His conclusions are clear: "Succinctly stated, these (…) findings reveal a system which tends to favor early careers and good publication records regardless of years of service the individual has notched up in his/her current rank" (Marini, 2017, 202). More precisely, in the case of full professors, "despite some disciplinary differences, generally one or two indicators out of three clearly act as the main determinant of the decisions to award or not to award eligibility to apply for a full professorship. Hence, performance, especially in terms of quality and strategic scientific publications, is the key factor in pushing on and climbing the academic career ladder to the top. Furthermore, younger scholars in each position can bypass their older peers, even with the same indicators of productivity" (Marini, 2017, 202).
It seems that the new system has reduced the role of seniority and has placed a premium on the ability of young candidates to produce good research in their early years. Taking into account the long tradition of promotions by seniority, this seems to be an important result.

Promotion by non-publication criteria
The new system places large emphasis on transparency of promotion criteria based on publications and on the publicity of the CV and the list of publications. This places a severe constraint on the adoption of non-publication criteria, which in the past were used with more discretionary power. Poggi et al. (2019) have examined the entire collection of CVs of candidates to the 2012 ASN procedure (n= 59149 candidates, with 1,910,873 papers), using several machine learning techniques and a dedicated ontology they identify as many as 291 predictors, or attributes of candidates described in CVs that the members of the committee may have used to make the promotion decision. Their model outperforms the state of the art in predicting correctly the final decision. Among the top 15 predictors we find a small set that describes the career of candidates (affiliation, age, maximum number of years with affiliation to the same university, years since the first publication) but the bulk of criteria are all publication-related, describing the articles, the journals and journal categories, and the Impact Factor. Interestingly, the number of citations do not appear among the top criteria. The continuity of publications, on the contrary, as measured by the number of years without any publication, does appear among the top criteria, confirming the attention of committees to being active in research as a condition for promotion.

Promotion by gender
The notion that academic promotion is affected by gender bias has been examined by a large literature, which I cannot review here. I am interested in understanding whether a more transparent system such as ASN has mitigated the large gender discrimination that has afflicted the Italian academic system (and that of other countries as well), since long time. A few papers have addressed this issue.
De Paola and Scoppa (2015) have shown that promotion committees that include a woman have a higher probability to give promotion to women, controlling for research productivity. Bagues, Sylos Labini and Zinovyeva (2017) find a more complex effect, in which women in the promotion committee give higher scores to female candidates, but having a woman in the committee make the male members more severe against women. According to Marini and Maschitti (2018) the gender discrimination is still large, as men have around 24% more probability to be promoted full professors, in the 2013-2016 period, at parity of scientific production. This effect, however, is not due to the ASN procedure, but to the downstream decentralized process of promotion decided by departments among competing candidates, all having the habilitation. In other word, "evidence tells there is less gender discimination at ASN level, and substantally more gender discimination at the promotion level" (Marini andMaschitti, 2018, 1002). This effect is confirmed by Filandri and Pasqua (2019) who study the probability of career advancement (in both ranks of associate and full professor) in the period 2012-2016 for those professors who were accredited in 2012 or in 2013. They find that "on average, female assistant professors have a probability of advancement to associate professorships which is 8 percentage points lower than their male colleagues. This difference increases to 17 percentage points when we consider associate professors' probability of becoming full professors" (Filandri and Pasqua, 2019, 12). In other words, while the ASN procedure has reduced the gender discrimination effect, the effect is reproduced at decentralized level, for which the degree of transparency is lower.
Summing up, it seems that the introduction of indicators has reduced the weight of non-academic factors such as seniority and personal connections, has increased the importance of research productivity along the entire career, and has mitigated the gender discrimination at national level. The discrimination did not disappear, however, in the second stage of the procedure at departmental level.

Behavioral impact
There is little doubt that, given the pervasiveness and the impact of research assessment at institutional level, it has influenced the behavior of researchers. This is even more so given the joint introduction of evaluation at university level and of indicators in the recruitment process. A trickle down is certainly in place (Aagard, 2015).
According to Moed (2007) and Mingers and Leydesdorff (2015) in the application of bibliometric indicators for research evaluation it is important to put in the agenda the issue of behavioral impact. After being created, indicators take a life of their own. Users of indicators throw away the instructions for use and select quick-and-dirty information for their convenience (van Raan, 2005). In turn, the very existence of quantitative indicators has large implications for the behavior of researchers. Indicators create incentives and disincentives, recommend some behaviors and discourages others (Burrows, 2012;Dahler Larsen, 2014). A recent literature has called the attention on the unintended consequences (Weingart, 2005) on the behavior of researchers. Examples of negative impact include goal displacement (i.e. aiming at meeting indicators, not producing valuable knowledge), selection of communication channels that are not consistent with the scientific community (e.g. publishing in English journals instead of writing books in national language), or lack of integrity (e.g. engaging into gift or cohercive authorship, gaming with citations, and the like) (Laudel and Gläser, 2006;Van Dalen and Henkens, 2012;Hammarfelt and de Rijcke, 2015;De Rijcke et al. 2016;Muller and De Rijcke, 2017). Following this critical literature a few authors have argued that the introduction of research assessment in Italy has produced negative behavioral consequences. Baccini, De Nicolao and Petrovich (2019) show that after the introduction of the research assessment the number of citations by Italian authors to other Italian authors increased significantly and much more than in other countries. They interpret this evidence as a perverse effect of evaluation, leading researchers to inflate artificially their citations in order to favour colleagues (and, implicitly, expecting reciprocity). Overall, I find this paper flawed in providing compelling evidence and rigorous counterfactual reasoning. It is not at all demonstrated that the increase in citations to Italian colleagues is due to perverse gaming with indicators. First, according to their data the process started around 2010, i.e. before or shortly after the introduction of research assessment. Thus causality assumptions are weak. Second, making gift citations to other colleagues may be risky if the citing and the cited authors are in competition in promotion procedures. It is not clear why Italian researchers should be generous with their competitors. Third, it is likely that the overall internationalization and quality of Italian researchers increased in the period (perhaps as an effect of research assessment?), so that part of the citations are genuine recognition of merit, not gaming. The paper suggests a strong causality effect, without demonstrating it.
A more controlled counterfactual approach is taken by Seeber et al. (2019) in showing the increase in self-citations in four disciplines as a consequence of the introduction of bibliometric indicators in the National Habilitation. Their results are credible and point to gaming with citations when candidates are at the borders of classes of merit. At the same time, one might argue that the convenience of gaming depends very much on the expected outcome. In particular, it is easier to game with citations than with journal factors, and it is easier to game when the thresholds are low. In other words when the admissibility threshold to become associate professor is, say, an h-index of 4, it is easier to organize a clique of reciprocating citers in a few years and overcome the threshold. If the h-index is, say 10 or 12, the game is much more difficult. 11 Therefore this evidence is less dramatic than it is often said. It points to the need to design incentives in a way to anticipate gaming behavior, not necessarily to avoid indicators at all. As an example it is easy to introduce a control for self-citations and to eliminate them in any consideration of indicators.
More generally, it should be remarked that this literature is largely based on case studies, university-level field observation, and conceptualizations. As De Rijcke and co-authors are led to conclude after their extensive survey, "many studies are of tentative and theoretical nature, prophesizing on potential effects rather than documenting actual consequences" (De Rijcke et al, 2016, 6). I share the comment by Sivertsen (2017), who noted that this caution is completely lost in the reception of the Metric tide report (Wilsdon et al. 2015) and in the subsequent debate.
Much more research is needed before concluding that the behavioral impact of research assessment is detrimental to the scientific enterprise.

Deterioration of pluralism
Among the negative consequences of research assessment the reduction of epistemic pluralism is often mentioned (Viola, 2017). Given the role played in research assessment by citation indicators, there is inevitably a premium on research topics that are more popular and journals that have larger impact factors. In turn, this creates a disadvantage for minority positions.
A case in point is represented by economics, in which minority positions are represented by non-mainstream scholars, working in various non-neoclassical traditions (Marxist, Sraffian, Austrian, Post-keynesian, or institutional economics). Corsi, D'Ippoliti and Zacchia (2018;2019) have repeatedly criticized the research assessment procedures carried out in Italy as a source of discrimination against heterodox economists. In a Research Policy paper they examine the habilitation decisions in the field of Economics and find a number of factors that led to negative outcomes. Among them are the number of books and the number of articles or chapters in books (two of the three threshold indicators) and the number of articles in heterodox journals. Promotion committees gave the habilitation almost exclusively on the basis of the number of articles published in A-rated journals, or a list of top 454 journals, irrespective of the fact that heterodox candidates had produced more articles in non-top journals, chapters and books than their mainstream colleagues (Corsi, D'Ippoliti and Zacchia, 2019). They use this evidence to join the opinion, held by several heterodox scholars in the UK after the RAE/REF experience, according to which research assessment is responsible for the elimination of dissenting views.
I believe that research assessment should be neutral with respect to epistemic differences in disciplines (Bonaccorsi, 2018b). In practical terms, scholars from mainstream and minority positions should be systematically involved in expert panels and committees. This is what happened in the first VQR 2004-2010, in which the GEV in Economics was chaired by a leading mainstream economist (Tullio Jappelli) but included, among others, the leaders of heterodox economics in the institutional (Neri Salvadori) and evolutionary (Giovanni Dosi) traditions. Interestingly, while after the 2000-2003 VTR there was a famous minority document from Luigi Pasinetti (a leading authority in structural and Austrian tradition) contesting the evaluation, in the 2004-2010 GEV all judgments and scores were approved with unanimous vote. When addressing the literature that criticizes the research assessment as a threat to pluralism, I find some arguments problematic in the causal assumptions. Scientific disciplines have a dual dynamics, one of epistemic type in which competing theories and paradigms challenge each other to address relevant scientific issues, and one of sociological and institutional type, in which the reproduction of scholarship is at stake. It is not clear whether the relative dynamics between mainstream and heterodox positions depends causally on research assessment. In order to substantiate this argument one should at least show that in countries in which there is no research assessment heterodox positions survive better or grow more than in countries subject to research assessment.

Polarisation of the higher education system
The publication of the results of VQR 2004-2010 showed a large gap between universities located in the Southern regions and those located in the North and Centre. Southern universities are placed, with limited exceptions, in the bottom part of the ranking used to allocate performance-based funding. Although the financial implications have been mitigated by the Ministry by placing an upper limit on the penalization, the impact has been almost immediate.
This has opened a large debate on the risk that the gap could be widened and made irreversible. Southern universities, which operate in less privileged areas and are subject to large student-staff ratios due to the demographic pressure, the argument goes, will receive less and less resources and will never be in the position to improve their position. Several authors have therefore argued that research assessment in the Italian context is an instrument for perpetuating and widening spatial and social inequalities (Viesti, 2016;Grisorio and Prota, 2020). This issue is part of a more general argument according to which research assessment is invariably associated to the deepening of inequalities (Warren et al. 2020).
This argument deserves close attention. The premise of research assessment is that it can produce behavioral changes that improve the position of those below the average. Allocating resources according to the quality of research should create appropriate incentives for improving, either by placing more effort in research and by recruiting academic staff with a good research record. If, on the contrary, financial constraints or resource endowments make the improvement unlikely for poor performers, the gaps become irreversible. The importance of these dynamic effects must be examined empirically, however. A crucial issue here is that the introduction of research assessment and performance-based funding has taken place in Italy in parallel with significant cuts in the government budget, particularly after the 2008 crisis. This is a major government and political mistake, insofar it has created the deeply held belief that performance-based funding is nothing else than a technical instrument to reduce the resources to the higher education system. Under these conditions, even marginal reductions in funding to universities in Southern regions may be serious.
From this perspective, it can be said that research assessment is held responsible for others' faults. From an empirical perspective, however, the depauperation effect of research assessment is not supported. Following the methodology introduced by Buckle et al. (2020) in New Zealand, Checchi et al. (2020) found that the polarisation effect is not confirmed by the empirical evidence. Comparing the research quality of universities between the VQR 2004-2010 and the VQR 2011-2014, after making scores comparable, they find a remarkable process of convergence towards the mean, or reduction of inequalities between Southern universities and universities located in North and Centre.
More empirical work must be done, however, to examine the unintended consequences on spatial and social inequalities, particularly under a regime of fiscal discipline.

Policy highlights and conclusions
In a relatively short time frame the Italian research and higher education systems have been subject to a pervasive in troduction of evaluation. There is probably no other sector of the public administration in which all personnel is subject to evaluation in such a systematic way.
It is no surprise that the introduction of research assessment has generated a huge debate. In this paper I have distilled the most relevant criticism, from a technical and substantive perspective.
I find some of the criticism well founded and constructive. For example, it is clear that the use of journal-based indicators and the need to build up a weighting scheme are problematic from the perspective of the state of the art of evaluative informetrics (Moed, 2017;. Nevertheless, they are used quite frequently in evaluation studies, according to a recent meta-analytic survey (Jappe, 2020).
I believe the methodological choices used in research assessment should be the object of open discussion. Once they are adopted, it is good policy to keep them stable for a certain number of years, in order to align the behaviours of researchers and ensure comparability of the exercises over the years. This implies some stickiness and the ability to justify the choices under the fire of criticisms.
At the same time, things may change. To make an example, if in the future the journal-level indicators will be eliminated, that will not imply any loss of legitimacy of the previous choices. Pragmatically, it would be useful to compare the results with an appropriate sampling approach in order to derive policy implications.
On the contrary, the criticism against the dual methodology (peer review and bibliometrics) and the argument in favour of the assessment of the entire production of scholars, instead of a submitted sample, are not realistic.
Using peer review for SSH (with a few exceptions) and bibliometrics for STEM is a methodological choice that can be defended, after appropriate normalization, in a large scale exercise. Given the impossibility to submit all products to peer review due to budget constraints, the choice to ask researchers to submit a selection of products of their choice is also good practice. Even if the submission is done not by individuals but by universities, the involvement of all researchers creates more engagement.
At the same time, the arguments that research assessment promotes research misconduct, reduces epistemic pluralism and interdisciplinarity, does not address gender discrimination, and deepens regional inequalities are conceptually and methodologically weak.
As stated above, the premise of research assessment is that those below the average may find incentives, opportunities and guidelines to improve their research performance. The overall system should monitor closely whether this happens, or whether dynamic self-reinforcing mechanisms widen inequalities and made them irreversibile. To the best of my knowledge, the argument that research assessment is damaging the scientific system and overall society is not empirically well grounded.
Research assessment is by nature and purpose a social experimentation, ultimately rooted in the scientific method, hence open to criticism. Even harsh criticism. At the same time, there is no perfect research assessment. The argument "better no assessment than imperfect assessment" is ideological. We should pursue improvement, not perfection.
I hope the paper has offered a dispassionate discussion.