Developing a Method for Evaluating Global University Rankings

Describes a method to provide an independent, community-sourced set of best practice criteria with which to assess global university rankings and to identify the extent to which a sample of six rankings, Academic Ranking of World Universities (ARWU), CWTS Leiden, QS World University Rankings (QS WUR), Times Higher Education World University Rankings (THE WUR), U-Multirank, and US News & World Report Best Global Universities, met those criteria. The criteria fell into four categories: good governance, transparency, measure what matters, and rigour. The relative strengths and weaknesses of each ranking were compared. Overall, the rankings assessed fell short of all criteria, with greatest strengths in the area of transparency and greatest weaknesses in the area of measuring what matters to the communities they were ranking. The ranking that most closely met the criteria was CWTS Leiden. Scoring poorly across all the criteria were the THE WUR and US News rankings. Suggestions for developing the ranker rating method are described. University


INTRODUCTION
Global university rankings are now an established part of the global higher education landscape. Students use them to help select where to study, faculty use them to select where to work, universities use them to market themselves, funders use them to select who to fund, and governments use them to set their own ambitions. While the international research management community are not always the ones in their institutions that deal directly with the global university ranking agencies, they are one of the groups that feel their effect most strongly. This might be through their university's exclusion from accessing studentship funding sources based on its ranking position; through requests to collect, validate, and optimise the data submitted; or through calls to implement strategies that may lead to better ranking outcomes. At the same time as having to work within an environment influenced by university rankings, the research management community are acutely aware of, and concerned about, the perceived invalidity of the approaches they use.
For this reason, the International Network of Research Management Societies (INORMS) Research Evaluation Working Group (2021) decided to dedicate one of their work-packages to developing a tool by which the relative strengths and weaknesses of global university rankings might be surfaced, and used to influence behavioural change both by ranking agencies and those who rely upon them for decision-making.

LITERATURE REVIEW
The practice of ranking universities goes back to the early twentieth century when informal lists of US universities were occasionally published. The first formal and significant ranking, however, was the US News America's Best Colleges published from 1983 (Meredith, 2004), which was followed by national rankings in the UK, Canada and other countries. These national rankings have been subject to a great deal of discussion, including critical assessments of their methodologies by Dichev (2001), Bastedo and Bowman (2010) and Bowden (2000).
The first international, but not global, ranking was published by Asiaweek in 1999and 2000(Asiaweek, 2000 and this was followed by the first global ranking published in 2003 by Shanghai Jiao Tong University, the Academic Ranking of World Universities (ARWU). These developments are described in several sources including Usher and Savino (2007), van Raan (2007), Holmes (2010), and Pagell (2014). Since then, a large and varied literature on international rankings has emerged.
The year 2004 saw the appearance of the Ranking Web of Universities, or Webometrics rankings, which at first measured only web activity, and the World University Rankings published by Times Higher Education Supplement (THES) and the QS graduate recruitment firm. Critical accounts of the THES-QS world rankings and their successors can be found in Holmes ( & 2015. Other global, regional, subject, and specialist rankings have followed, a list of which was compiled by the Inventory of International Rankings published by the International Ranking Expert Group (IREG) (2021).
In recent years, media and academic interest has shifted to global rankings which have been severely criticised on several grounds within the international research community. General critiques are offered by Usher (2014;, , Pagell (2019), Bekhradnia (2016), Lee and Ong (2017), and Gadd (2020). Marginson and van der Wende (2007) have argued that they privilege the western model of research-intensive universities with an emphasis on the natural sciences, particularly in English speaking countries. Similarly, it has been noted that rankings favour universities that specialise in certain subjects (Bornmann, De Moya, Anegón & Mutz, 2013).
Other texts claim that rankings promote global elitism by encouraging a shift of investment towards highly ranked universities at the expense of others (Munch, 2014) and intensifying inequality within and between institutions (Cantwell & Taylor, 2013). According to Amsler and Bolsmann (2012) they are part of an exclusionary neoliberal global agenda.
Others have presented evidence that rankings encourage universities to forget their social role or to lose interest in local or regional problems. Lee, Vance, Stensaker, and Ghosh (2020), for example, report that world-class universities are likely to neglect the university third mission and those that are unranked are more concerned with the local economy and its problems. Stack (2016) has written of the pressures that rankings place on universities in the struggle for resources in a competitive and mediatised global society.
A number of writers have discussed methodological and technical issues. Turner (2005) showed that there are problems with arbitrary weightings and the conflation of inputs and outputs. Others have focussed on the validity of reputational surveys. Safón (2013), for example, has argued that the main factor measured by global rankings is institutional reputation while a study by Van Dyke (2008) noted that academics were likely to rate their own institutions higher than others, a finding that has implications for the validity of the THE reputation survey. An analysis by Safón and Docampo (2020) indicates that reputational bias influences publication data in the Shanghai rankings while Ioannidis et al. (2007) have criticised those rankings and the THES-QS world rankings for measurement error and limited construct validity. Daraio, Bonaccorsi, and Simar (2015) have focussed on problems of monodimensionality, statistical robustness, institutional size, and the inclusion of measures of output and input. Florian (2007) has argued that the Shanghai Rankings cannot be reproduced exactly and therefore are methodologically deficient.
The impact of rankings on university policy has been discussed by Docampo, Egret and Cram (2015) who suggest that they have prompted structural changes, especially in France, that may not be entirely beneficial. Cremonini et al, (2014) claim that the use of rankings to promote world class university programmes may reduce the public benefits of higher education policies.
Aguillo, Bar-Ilan, Levene and Ortega (2010) have compared rankings with different methodologies and noted that while there was a significant similarity between the various rankings it was greater when European universities were considered. A study by Buela-Casal et al (2007) noted significant similarities in various rankings, although not on all indicators. Moed (2017) observed that of the five key university rankings, only 35 institutions appeared in the top 100 of all of them. Piro and Svertsen (2016) have analysed the reasons why different rankings produce different results. Bookstein, Seidler, Fieder and Winckler (2010) have observed fluctuations in the indicator scores in the THE rankings while Vernon, Balas and Momani (2018) have cast doubt on the ability of rankings to measure and improve research. A longitudinal analysis of the university rankings by Selten et al (2020) suggests that the indicators used do not capture the concepts they claim they measure.
Although the critiques of global rankings are wide-ranging, they have not had much influence on higher education leaders. University administrators have sometimes expressed reservations about rankings but in general have been willing to participate, occasionally breaking ranks if their institutions fall too much (Hazelkorn 2008;2011). Musselin (2018) explains this phenomenon in terms of university leaders utilising their ranking position as management and legitimisation devices in an increasingly competitive environment.
Some scholars believe that the defects of global rankings far outweigh any possible benefits. Adler and Harzing (2009) have gone so far as to propose a moratorium on ranking and recommend that scholars should "innovate and design more reliable and valid ways to assess scholarly contributions that truly promote the advancement of relevant 21st century knowledge, and likewise recognize those individuals and institutions that best fulfil the university's fundamental purpose." It must be noted that the academic literature does include attempts to justify the role and methodology of rankings. Sowter (2008), Baty (2013) and Wu and Liu (2017), who are representatives of ranking agencies, have described the rationale behind the various methodologies.
There are others who find some merit in the rankings. Wildavsky (2010) sees the rankings as instrumental in the development of a new academic and scientific global marketplace leading to the spread of new knowledge. Boudard and Westerheijden (2017) describe how the shock of the first Shanghai rankings for continental European universities led to far-reaching structural changes. Rodionov, Fersman and Kushneva (2016) outline how in Russia the rankings are considered an important element in improving international visibility and status. In Taiwan, Shreeve (2020) has suggested that governmental ambitions encouraged by rankings may be beneficial for the institutions concerned by providing a focus for improvement. According to It must be said, however, that some of these studies also observe that the characterisation of a 'top' university as defined by the rankings, and the pursuit of a better ranking position based on developing such characteristics, might not always be locally relevant or ultimately beneficial.
There have been attempts to construct rankings that avoid the various problems and defects that have been identified. Waltman et al. (2012) describe the development of the Leiden Ranking which introduced innovations designed to meet some of the criticism that had been levelled, such as fractional counting and stability intervals. Since then, the Ranking has included data about open access publications and gender equity in publication. U-Multirank was developed with support from the European Commission to provide a user-driven, participatory and multidimensional ranking providing features neglected by the dominant rankings (Van Vught and Ziegele, 2012).
Another attempt to reform international rankings was the production of the Berlin Principles on Ranking of Higher Education Institutions in 2006 (IHEP, 2006). These were produced by the International Rankings Expert Group (IREG), which consisted of both rankers and academics, and was founded by the UNESCO European Centre for Higher Education (UNESCO-CEPES) and the Institute for Higher Education Policy. The principles covered the purposes of rankings, the design and weighting of indicators, the collecting and processing of data, and the presentation of results (IHEP 2006). The principles now underpin the criteria used to offer the 'IREG Seal of Approval' to rankings. However, Barron (2017) has questioned the extent to which these rankings meet the Berlin principles and expressed concerns that the principles seek to legitimise ranking practices by attempting to align them with academic values.
The Centre for Science and Technology Studies (CWTS) in Leiden, home of the Leiden Ranking and birthplace of the Leiden Manifesto on the responsible use of research metrics (Hicks, Wouters, Waltman, de Rijcke, and Rafols, 2015), also subsequently developed ten principles for the responsible design, interpretation, and use of university rankings (Waltman, Wouters and van Eck, 2017).
The existence of such principles is a welcome attempt to provide some best practice guidance for the design and use of university rankings. However, the fact that they were influenced and/ or developed by university rankers themselves could be seen to affect their neutrality. It is also concerning that the only body currently providing any assessment of university rankers is one where rankers occupy five out of eleven seats on the Executive Committee (IREG, 2021).
It was against this background that the INORMS Research Evaluation Working Group sought to both provide an independent, community-sourced set of best practice criteria against which to assess the global university rankings and then to identify the extent to which a sample of rankings met those criteria.

METHODS
In parallel with the INORMS REWG's work to rate the global university rankings, the group also developed a framework for responsible research evaluation, called SCOPE (Himanen and Gadd, 2019). This is a five-stage process by which evaluations can be designed to consistently adhere to best practice in research assessment. In order to develop a responsible approach to evaluating the global university rankings, the SCOPE framework was adopted as follows.

START WITH WHAT YOU VALUE
The 'S' of SCOPE states that prior to any evaluation attempt there needs to be a clear articulation of what is valued about the entity under evaluation from the perspective of the evaluator and the evaluated. To this end the group undertook a literature search to develop a draft set of best practice criteria for fair and responsible university rankings. These were circulated to the international research evaluation community for comment via the INORMS REWG, LIS-Bibliometrics, and INORMS member organisation circulation lists such as that of the UK Association of Research Mangers and Administrators (ARMA) Research Evaluation It could be argued that some of the criteria are challenging to meet, especially for some of the commercial ranking agencies, for example, around conflicts of interest. However, it was felt to be important to remain true to the values of the community even where they were aspirational. As noted by Gadd and Holmes (2020) on the publication of the ratings, "just because something is difficult to achieve, doesn't mean we shouldn't aspire to it". The benefit of taking a valueled approach, such as that promoted by SCOPE, is that the evaluation is driven by what the community cares about, rather than by what might be possible or practical. Indeed, it could equally be argued that if it is not possible to rank organisations in accordance with the best practice principles developed by the communities being ranked, perhaps it is the rankings that should change, not the principles.

CONTEXT CONSIDERATIONS
The 'C' of the SCOPE framework states the importance of considering the evaluative context -why and who you are evaluating -prior to the evaluation. The purpose of this evaluation was to highlight the extent to which various ranking agencies adhered to the best practice expectations of the wider research evaluation community, to expose their relative strengths and weaknesses, with the ultimate purpose of incentivising them to address any deficiencies. By clarifying the context, the group moved away from early thoughts of 'ranking the rankings', recognising that this might only lead to self-promotion by the top-most ranked, rather than behaviour change.
As the work was undertaken by volunteers, it was not possible to assess all the global university rankings, so it was decided to test the model on six of the largest and most influential university rankings to provide a proof-of-concept. This group was selected by consulting with members of the INORMS REWG as to the most frequently used rankings in their region, and is not a reflection on their quality. The final list included ARWU, CWTS Leiden, QS World University Rankings (QS WUR), Times Higher Education World University Rankings (THE WUR), U-Multirank, and US News & World Report Best Global Universities. Although many of these ranking agencies produce more than one ranking, it was decided to focus on their flagship 'overall' global ranking product for this evaluation, as these are the rankings most commonly used and cited.

OPTIONS FOR EVALUATING
Having established our values and context, the SCOPE framework's 'O' -options for evaluating -were then considered. To run the assessment, the criteria collected at the 'values' stage were translated into assessable indicators that were felt to be suitable proxies for the criteria being assessed. As the group were seeking to assess qualities rather than quantities, it was felt to be important to provide assessors with the opportunity to provide qualitative feedback in the form of free-text comments, as well as scores on a three-point scale according to whether the ranker fully met (2 marks), partially met (1 mark), or failed to meet the set criteria (0 marks).
To ensure transparency and mitigate against bias, twelve international experts were identified and invited by members of the INORMS REWG to provide a review of one ranking agency. Due to the pandemic only eight were able to provide a rating.
INORMS REWG members also undertook evaluations, and, in line with the SCOPE principle of 'evaluating with the evaluated,' each ranker was also invited to provide a self-assessment in line with the community criteria. Between one and four reviews were received for each ranking. Only one ranking agency, CWTS Leiden, accepted the offer to self-assess, providing free-text comments only.
The reviews were then forwarded to a senior expert reviewer, Richard Holmes, author of the University Ranking Watch blog . He was able to combine the feedback from our international experts with his own detailed knowledge of the rankings supplemented by intelligence sourced from conferences and online communications, to enable a robust, expert assessment. In cases where a question was interpreted differently by reviewers, he used his judgement to decide on the most appropriate interpretation and score.

PROBE DEEPLY
The 'P' of the SCOPE framework represents 'probe' and requires that any evaluative approach is examined for discriminatory effects, gaming potential and unintended consequences. We observed some criteria where rankings might be disadvantaged for good practice, for example where a ranking did not use surveys and so could not score. This led us to introduce a 'Not Applicable' category to ensure they would not be penalised.
It was also thought to be important that we did not replicate the rankings' practice of placing multi-faceted entities on a single scale labelled 'top'. Not only would this fail to express the relative strengths and weaknesses of each ranking, but it would give one ranking agency 'boasting rights' which would run counter to what we were trying to achieve.

EVALUATE
The 'E' of SCOPE invites assessors to both evaluate and evaluate their evaluation. The ranker assessment generated many learning points discussed in section 5 below which fed into recommendations for the revision of the ranking assessment tool.

FINDINGS
The full set of attributed ranking reviews and the final calibrated review have been made openly available (INORMS, 2020). Intra-class Correlation Coefficients were calculated for the each set of reviews (Table 1) which indicate moderate to good inter-rater reliability (Koo and Li, 2016). Some reflections as to how these might be improved, including clearer definitions, are provided in Section 5.

GOOD GOVERNANCE
The five key expectations of rankers with regards to good governance were that they engaged with the ranked, were self-improving, declared conflicts of interest, were open to correction and dealt with gaming. The full criteria and indicators are listed in Table 2

Engaged with the ranked
One of the SCOPE principles is to evaluate with the evaluated, and the community felt that having continued engagement with both the faculty and leadership of organisations that they ranked was an important activity. The rankings tended to score well here with most having advisory boards, and all engaging in some form of outreach activity.

Self-improving
One of the biggest concerns about the rankings is their methodological imperfections. This question sought to highlight that ongoing improvement was an essential activity for ranking agencies. Again, all rankers either fully or partially met this criterion.

Declare any conflicts of interests
There was a belief amongst the community that ranking agencies should remain independent in order to fairly rank universities. As such, where there were conflicts of interest, i.e., where rankers sold their data or provided consultancy services to institutions with the ability to pay for it, this should be declared. No ranker fully met these expectations, and all received at least one zero in this section.

Open to correction
The community felt that an important aspect of good governance was that any errors drawn to ranking agencies' attention should be corrected and clearly indicated as such. Where data was drawn entirely from third parties it was felt that this criterion was not applicable. In all other cases, HEIs were given some opportunity to check the data prior to the ranking being compiled. In most cases there was some line of communication by which HEIs could notify ranking agencies of errors, but only CWTS achieved full marks for clearly listing corrected errors.

Deal with gaming
The rewards associated with a high ranking position are such that 'gaming' is a regular feature (Calderon, 2020). The community were concerned that ranking agencies recognised this and took steps to address gaming where it was drawn to their attention. Where third-party data was used, again, this was not thought to be an applicable criterion. Other ranking agencies, all made some effort in this space, with full or partial compliance.

TRANSPARENCY
Transparency was very important to the community with many respondents making reference to the 'black box' nature of many rankings' approaches. The five expectations of rankers here were that they had transparent aims, methods, data sources, open data and financial transparency. The full criteria and indicators are listed in Table 3.

Transparent aims
All rankers were either fully or partially transparent about the aims of their ranking and its target groups. Of course, transparency about their aims is not the same as successfully meeting them.

Transparent methods
The requirement of transparent methods was particularly important to the research management community, as many are asked to reverse engineer their institution's ranking position and make predictions about future performance. Whilst most rankers fully met expectations around publishing their methods and indicators, in only one case (ARWU) was it thought to be possible for a third-party with access to the data to be able to replicate the results.

Transparent data availability
Questions around data availability required rankers to describe both their sources and their parameters in detail, with a specific question regarding the ability to correct data. Again, all rankers fully or partially met these criteria.

Open data
In addition to data being fully described, it was felt to be important that this was also openly available for the community to scrutinise and work with. Only ARWU received full marks on this, with other rankings making some data available.

Financially transparent
As with the declaration of conflicts of interest, the community were keen that ranking agencies were financially transparent, revealing sources of income. Only U-Multirank fully met this criterion, with four out of the remaining five failing to meet it.

MEASURE WHAT MATTERS
The five expectations of rankers here were that they drove good behaviour, measured against mission, measured one thing at a time (no composite indicators), tailored results to different audiences and gave no unfair advantage to universities with particular characteristics. The full criteria and indicators are listed in Table 4

Drive good behaviour
With widely acknowledged limitations of university rankings, the community felt it was important that ranking agencies themselves did their best to highlight this on their products. CWTS Leiden and U-Multirank clearly did so; ARWU and US News did not, and QS and THE made some reference to it which was felt to be undermined by their repeated reference to their rankings being 'trusted' or 'excellent' sources.

Measure against mission
Whilst universities largely seek to offer teaching and research in some form, their missions and other characteristics such as size and wealth, are hugely varied. The community felt it was important that rankers provided a facility by which institutions could be compared to others with similar characteristics rather than grouping all together on a single scale. Only U-Multirank and CWTS Leiden avoided offering one single over-arching ranking that sought to identify the 'top' universities, with only U-Multirank providing a facility by which rankers could be compared with those sharing their mission. All provided subject-based comparisons and most provided some qualitative data on the organisations being ranked.

One thing at a time
A related criterion to that specifying that rankers should avoid using a single scale of excellence, was that they avoided composite metrics that used pre-set weightings regardless as to whether institutions weighted their focus on the same way. Again, only CWTS Leiden and U-Multirank avoided composite metrics, thus achieving full marks.

Tailored to different audiences
Recognising that rankings are used by different audiences for different purposes, the community felt it important that the ranking data collected was delivered in different formats according to the interests of these different audiences. No ranking fully met expectations here (although through the avoidance of composite indicators, this was not thought to be an applicable question for CWTS Leiden). Most others scored poorly.

No unfair advantage
Whilst living in an 'unfair' world, it was still felt to be important to avoid offering an unfair advantage to universities with particular characteristics (size, discipline, geography and English language-use) as far as possible. While it was felt that all rankings made some effort in this space, none scored full marks.

RIGOUR
The five expectations of rankers in this section were around rigorous methods, no 'sloppy' surveys, validity, sensitivity and honesty about uncertainty. The full criteria and indicators are listed in Table 5

Rigorous methods
A common complaint regarding the use of rankings by scientific organisations is that they use methods that those organisations would not consider valid in their own practices. Two questions around field normalisation and the handling of outliers yielded mixed results with some efforts around both but very few exemplars of best practice.

No 'sloppy' surveys
While the community were not against survey methodologies per se, there was a strong sense that the rankers use of surveys was problematic with questionable practices employed around samples and question choice. ARWU and CWTS Leiden avoided using surveys altogether thus achieving full marks on this criterion. Whilst sample sizes tended to be large, they were rarely random and not always thought to be representative. There was no evidence of reliability testing on questions, nor that the questions were entirely valid.

Validity
In any evaluation approach it is important that the indicators used are a valid enough proxy for the quality being measured. Only CWTS Leiden and U-Multirank were thought to fully meet this requirement, with the other rankers making some efforts but falling short of expectations. Gingras (2014) has noted the importance of avoiding monotonic indicators in evaluation approaches where a 'good' score will vary according to the mission or disciplinary mix of the organisation (e.g., staff:student ratios). He also highlighted the importance of evaluation outcomes varying only in accordance with real (and often slow-moving) changes within those organisations. Only CWTS Leiden scored full marks on these indicators, with others demonstrating less, or no, compliance.

Honest about uncertainty
In any evaluative data there are always going to be levels of uncertainty around the confidence in which the results can be relied on. The community were keen that confidence levels made visible the relatively small differences between those organisations at different ranking positions. Again, only CWTS Leiden clearly expressed the limitations around their methodologies and provided stability indicators for their rankings. Others made some efforts with regards to the former but failed to score on the latter. Sensitivity. Indicators are sensitive to the nature of the characteristic they claim to measure.
D4.1 Does the ranking AVOID include monotonic indicators for which a good value will depend on the mission of the university, e.g., staff-student ratio; international-non-international staff ratio.

D4.2
Are ranking results relatively stable over time? E.g., are improvements in rank likely to reflect true improvements in University performance?

D5
Honest about uncertainty. The types of uncertainty inherent in the methodologies used, and of the data being presented should be described, and where possible, clearly indicated using error bars, confidence intervals or other techniques, without giving a false sense of precision.
D5.1 Does the ranking website provide any commentary on the limitations and uncertainties inherent within their methodologies?
D5.2 Does the ranking provide error bars or confidence intervals around the indicators provided?
12 Gadd et al. Scholarly Assessment Reports DOI: 10.29024/sar.31 4.5 SUMMARY Figure 1 illustrates the relative strengths and weaknesses of each global university ranking by dividing their actual scores for each section into the total possible score they could have achieved, having removed all 'not applicable' criteria. It shows that in terms of good governance, the QS achieved the highest scores, closely followed by U-Multirank and CWTS Leiden. In terms of transparency, ARWU and CWTS Leiden performed the best, again with U-Multirank close behind. The strongest performer on 'measure what matters' was U-Multirank with CWTS Leiden closely following. Finally, in terms of rigour, CWTS Leiden was significantly stronger than all the other rankings. Scoring consistently poorly across all the community-developed criteria were the THE WUR and US News rankings. Figures 2-7 provide a more granular picture for each of the rankings assessed by looking at their scores on each of the twenty criteria. Where a criterion was not applicable it was removed from the chart and indicated as such.
Whilst the rankings that score better on these indicators may feel pleased with their performance, it is important to note that the community expectations are set at 100% adherence to the criteria. The closest any of these ranking agencies came to that was CWTS Leiden which scored 100% on nine of the eighteen criteria deemed to be applicable to them. Indeed, when you      look at the average scores for all six rankers across the four criteria (Figure 8) you can see that overall, the ranking sector falls considerably short of all criteria, with the greatest strengths in terms of transparency and the greatest weaknesses in terms of measuring what matters to the communities they are ranking.

REVISED RANKING ASSESSMENT TOOL
The process of piloting the ranker rating tool surfaced many helpful learning points, including feedback from one the ranking agencies under assessment, CWTS Leiden, which it would be useful to incorporate into any future iteration. These are outlined below.

START WITH WHAT YOU VALUE
By putting out a 'straw person' list of draft criteria for fair and responsible university rankings and inviting free-form feedback, useful input was received. However, all the criteria were given equal weight in the resulting assessment tool. This may not reflect community expectations, with some criteria being of paramount importance and other criteria holding less importance. Future iterations of the assessment tool may wish to revisit both the chosen criteria and the resulting weightings via some kind of survey instrument. A survey may have the benefit of reaching a wider audience through the relative ease of completion and may enable a more nuanced assessment of ranking agencies.

CONTEXT CONSIDERATIONS
The selection of ranking agencies and the focus on their flagship ranking for this pilot was a pragmatic choice due to time constraints and the need to recruit reviewers. However, to provide a more complete assessment of a much wider range of the increasing number of global rankings, in an ideal world this would be extended both to additional rankers and additional rankings (e.g., subject rankings).
Due to the impact of COVID-19 on workloads and the resulting availability of expert reviewers, some rankers only received one expert review and one senior expert reviewer calibration in this exercise. In an ideal world each ranking would receive a minimum of two expert reviews plus calibration. Even better, in line with the principle of evaluating with the evaluated, each ranking agency would submit a self-assessment to fill in any gaps not publicly available, or not known to the expert reviewers. If such assessments grow in popularity and visibility, it may be that more ranking agencies become willing to provide a self-assessment to make the case for their activities.

OPTIONS FOR EVALUATING
It was noted by reviewers that multi-part questions such as D1.2 "Does the ranking clearly state how it handles outliers and is this fair?" were difficult to assess. In future, such questions should be split into two. The other over-arching recommendation is that a more granular scoring system, perhaps across a five-point scale, would allow for fairer assessment. In the current exercise the use of 'partially meets' covered a whole range of engagement with the stated criterion, from slightly short of perfection to a little better than fail.
There were also some issues with particular questions as outlined below. A1 Engage with the ranked. Rankers should score less well if their engagement activity was simply marketing and promotion.
B1 Transparent aims. Rankers should score less well if they make claims to identify the 'best' or 'top' institutions.
B2 Transparent methods. Rankers should score less well if their methods were not transparent about their normalisation mechanisms.
B3 Transparent data availability. Remove question B3.3 (Does the agency provide clear opportunities for errors to be corrected?) because it is very similar to A4.3 (Are corrected errors clearly indicated as such?).
B4 Open data. Reward ranking agencies for the use of open standards for making data openly available.
C5 No unfair advantage. It was felt that such requirements were impossible to meet even by the best-intentioned of rankers due to the inequalities inherent in society. It was therefore proposed to retain the heading 'no unfair advantage' as an important principle, but to amend the subcriteria to reward those that seek to reduce disadvantage along these lines. D1 Rigorous methods. On criterion D1.1, CWTS Leiden have argued that normalization may not always be appropriate and therefore where a ranking can justify their normalization decisions this should give them full marks. For example, indicators of gender balance should not be normalized by field, but by representation in the global population.

On criterion D5.2 (Does the ranking provide error bars or confidence intervals around the indicators provided?) it was pointed out that error bars may in fact introduce a false sense of certainty about the level of uncertainty due to the challenges of properly quantifying error.
Other questions that might be useful to include would related to the user-friendliness of the ranking web page, perhaps under C4 'Tailored for different audiences', and the number of universities included in the ranking, perhaps under C5 'No unfair advantage'.

CONCLUSIONS
Global ranking agencies have a significant influence on the strategic and operational activities of universities worldwide, and yet they are unappointed and unaccountable. As a research management community we believe that there is a strong argument for providing an open and transparent assessment of the relative strengths and weaknesses of the global university rankings to make them more accountable to the higher education communities being assessed. We believe that the approach described in this report, as refined, offers a fair and transparent tool for running such assessments.
The findings of this exercise highlight that those rankings that are closest to the universities being assessed, the CWTS Leiden Ranking run by a university research group, and U-Multirank run by a consortium of European Universities, tended to better meet the community's expectations of fairness and responsibility. Unfortunately, those rankings that are more highly relied upon by decision-makers, such as the Times Higher Education World University Ranking and the US News and World Report ranking, tended to score less well. However, all rankings fell short in some way and this work highlights where they might focus their attention.
One of the challenges of short-term project-based work such as this is long-term sustainability and influence, options for which are now being explored by the INORMS REWG. As well as drawing this work to the attention of ranking agencies, it needs to also reach those relying on rankings data for decision-making. This is also one of the next steps for the group. Overall, this work has been warmly welcomed by the HE community, and by some of the rankers assessed. We hope that the next iteration of this tool, revised in line with our recommendations, will play a formative role in improving the design of university rankings and limiting their unhelpful impacts on the HE community.

ADDITIONAL FILE
The additional file for this article can be found as follows: • INORMS REWG Ranker Ratings Data. Qualitative and quantitative ranker ratings.