Towards creating responsible metadata consumers…
The first commitment of the Barcelona Agreement articulates that, ‘We will make openness the default for the research information that we use and produce’, but who ‘we’ are is critical in understanding all of our roles and responsibilities in the research ecosystem. Funders, publishers, infrastructure providers, institutions and researchers all have different ways of interacting with data in their contexts as producers, consumers and aggregators of data.
The Barcelona Declaration is perhaps the first document to begin to frame community responsibility with regards to consuming open metadata. Yet, it is just that, a beginning – we believe that understanding with granular detail what should be expected from each part of our ecosystem is critical in making Barcelona actionable, to drive us forward into a more open metadata landscape. Indeed, open metadata is only important if we commit to using it in our practice, allowing it to shape the way that we interact across the research world.
A commitment to consume, however, still requires us to pay attention to the type of open metadata that we use, the contexts in which we apply it, and the expectations that we place on others when doing so. Without explicitly articulating our roles both as creators as well as consumers of research metadata, we risk creating an open, yet untrusted research landscape.
Not all metadata are the same
There is a fundamental asymmetry between production and consumption (and also aggregation). Whilst the responsibilities associated with creating metadata are relatively easy to articulate, the responsibilities around consuming and aggregating metadata are not so well thought through as, to this point in time, this has been the less proximate issue. (Indeed, Barcelona makes it clear that we have reached a milestone in that we now need to consider this issue.) We argue that responsibilities around consumption are contextual in nature, depending on the provenance of the metadata itself, and work needs to be put into articulating these responsibilities for each participant and use case. In the context of the recent Barcelona Declaration then, it is useful to explore some of the different ways metadata can be created and then exploring what responsibilities could result for consumers.
Within the Barcelona Declaration there are (at least) three different sorts of metadata records that are implicitly referred to:
Open metadata records
Open metadata records are those that have been created from inception with open research principles in mind. For example, a publication created under these principles will have an ORCiD associated with each researcher and a ROR ID associated with each affiliation. Within the body of the publication (and its metadata), funding organisations will be linked to their Open Funder Registry ID (or ROR ID), and the grant itself will be linking to open persistently identified grant records (for example via the Crossref grant linking system). The publication itself (along with a rich metadata representation) will be associated with a DOI, and all references that resolve to a DOI will also be openly available. When we speak about open here, we have in mind a CC0 licence for these data. Within the paper itself we might expect to see other links such as a link to a data repository, along with other trust markers that establish the provenance of the paper and situate it within the norms of good research practice. We might have similar expectations for grants, datasets, research software code, and other research objects.
Algorithmically enhanced records
Algorithmically enhanced records are metadata records that have had elements derived from algorithmic processing that was not part of the original record. The algorithm may not be open, the approach used may not be known and the probability that the metadata is correct may also not be known. (This is something of a hidden variable in many analyses today – it is generally assumed that data in an article may have statistical variances but that metadata describing an article does not.) Many publication records that have been created over time do not meet our current requirements for metadata openness. Either the technology (or identifier infrastructure) did not exist at the time that they were created, or good metadata practices have yet to take hold within the context that the record was created. For records such as these, algorithms are used to enhance the record with identifiers. Prominent examples include algorithms that are used to identify institutional affiliations, but also to reconstruct researcher identities. Algorithms can also be used to enhance the description of a record by adding links to external research classifications that would never have existed in the original metadata.
This type of data is likely to become more and more commonplace as LLMs and other AI systems are becoming more easily and cheaply available. And hence, it is likely that for some years to come metadata will have inbuilt statistically generated inaccuracies which may be ignored by the community at large, if they can be proven to be negligible in key analyses.
Institutionally enhanced metadata records
Institutionally enhanced metadata records are those enhanced through university processes for the purposes of institutional and government reporting. These records, harvested from multiple sources, or manually curated, may have additional metadata associated with them. An author on a paper might be associated with an institutional ID, new research classifications might be added with links to dataset. These institutional records might be made public through institutional profiles or syndicated to larger state or national initiatives.
What are our responsibilities when using and reusing research metadata?
The text of the Barcelona Declaration treats all three types of metadata that we have defined above to be on an equal footing: To be shared under a CC0 licence, allowing an unrestricted ability to reuse. Issues of licence aside, the way we reuse metadata should be informed by the provenance of the created information.
When considering how to implement the objectives of the Barcelona Declaration then, it is worth thinking carefully about a general approach to the responsibilities associated with reuse. As with the Barcelona Declaration, we propose these as a beginning and a discussion rather than an absolute. Refining these responsibilities will take community discussion.
Here are three responsibilities that we think would be useful to begin the conversation:
Responsibility 1. The purpose for which a piece of metadata is intended to be used must place a limit on both the scope (types of interpretation) and range (geographical, subject or temporal extent) under which it can be responsibly used
Beyond considerations of openness, the context of the data that is being propagated needs to be considered. Metadata is generated for a purpose, and that purpose defines the accuracy and care to which the metadata is applied. It also defines the limits and responsibilities for maintaining its accuracy.
For institutions, the Barcelona Declaration explicitly identifies Current Research Information Systems (CRIS systems) as one mechanism to make research information open. It is required that all relevant research information can be exported and made open, using standard protocols and identifiers where available. This requirement builds on a movement, initially gaining traction around 2010 with the VIVO and Harvard Catalyst profiles projects funded by the NIH. The key use cases for these public profiles has been expertise finding, either at the institution, state, or national level. The key insight of this movement is that information collected for internal reporting and administrative purposes could also be used to create public profiles – a single source of information efficiently driving multiple uses. In some cases the approach of CRIS aggregated information has been taken further to create state-based portals such as the Ohio Innovation Exchange, or national open research analytics platforms such as Research Portal Denmark. Although successful, the nature of the provenance of these records means that there are practical limitations to the way the information can be reused beyond these applications.
Implicit in the name of a CRIS is a key limitation. CRISs are used to maintain/modify/aggregate information about ‘current’ researchers. There is (for an institution) no implied duty of care for the maintenance of public information about past staff. Indeed, from the perspective of expertise finding it may be inconvenient to have these profiles remain discoverable in the same way.
Metadata within CRIS systems are also often collected for a politically aligned purpose such as the demonstration of value to voters (which is often presented as a national purpose in the form of government reporting), and can lead to unbalanced metadata records when used in a broader context. For instance, publications recorded for the purposes of national reporting might very accurately record the researcher affiliations within a country, but will be significantly less accurate on international affiliations for whom the reporting exercise has little bearing.
Records can become unbalanced in other ways too: research can be classified to reflect the goals of the individual reporting exercises (a point that we wrote about in detail in our article on FoR Classification) – both in terms of the classifications that are applied, and the time and effort to which those classifications are maintained, and the scope of research classified. If there is a purpose to reusing this classification metadata in a different context, the provenance under which it was recorded must be maintained and understood.
A potential interpretation of the Barcelona Declaration could be that all metadata must be curated with the understanding that it will be used and consumed within the broader research community in perpetuity. If this is the intended interpretation, then we should be realistic about the extra effort that this requires, both in terms of effort and the structures that should be put around the codification and documentation of data curation approaches. This interpretation also instantly begs several practical questions: Does the storing, and passing on of a metadata record imply a responsibility to keep it up to date forevermore? What inequalities would this interpretation place on the broader research community? Specifically, does this interpretation advantage the “metadata rich” (those with the infrastructure to invest in improving records) and disadvantage the metadata poor (those who have poor embedded mechanisms or post hoc mechanisms for the curation of metadata)? This concern is not hypothetical, as current lack of visibility of African research has hindered efforts to comprehensively understand, evaluate and build upon African nations.
There are of course already remedies to address many of the persistence challenges associated with making institutional metadata open. One mechanism is to transfer the responsibility for the metadata from the institution to the individual researcher via their ORCiD. Within this workflow, researchers remain responsible for maintaining a public record of their outputs, and institutions can maintain responsibility for asserting when a researcher worked for them. Coupled with a national push to publish research in open access journals and repositories, the Barcelona Declaration complements the approach taken by national persistent identifier strategies as they move towards PID-optimised research cycles.
Responsibility 2. Machine-generated metadata should not be propagated beyond the systems for which it was created, without human curation or verification
Machine-generated metadata, such as the association of an institutional identifier to an address expressed as a string, research classifications, or algorithmically determined research IDs are all generated within precision and recall tolerances. These tolerances are set by system providers, and are aligned with the requirements of their users. Individual statements, however, are not guaranteed against any particular record. What is more, algorithmically generated data can be regenerated as methods improve, potentially invalidating records from previous runs. This notion defines a hitherto overlooked metadata provenance. Without accompanying provenance, metadata can be considered to have ‘escaped’ from its originating system and runs the risk of being “orphaned”, with no ability to be updated or appropriately contextualised. To move an algorithmically generated metadata record out of the context of the system for which it was created must be to take ownership of the provenance and the statements that can result from its use.
Whilst not so much of a problem for publications (an updated version of the record can always be requested using the DOI,) this is particularly problematic for algorithmically generated researcher IDs, as (in the case of an identifier that refers to more than one person), improved algorithms could radically change the identity that the researcher that the identifier refers to. In the case of a researcher record that split because it was really two researchers, the existing researcher ID could end up pointing to a different researcher.
The Barcelona Declaration is right to focus on data sharing practices using standard protocols and identifiers where available. But here too, care must be taken to assess where metadata has come from as many algorithms associate a persistent identifier with a metadata record. For instance, if an ORCiD is used instead of an internal researcher ID to refer to a researcher, but the set of assertions that are produced have been algorithmically generated, then communicating these assertions outside of the system that they were generated breaks the model of trust established by ORCiD.
Responsibility 3. Ranking platforms should be independent of the data aggregations from which they are drawn
A key use case enabled by algorithmically generated metadata is comparative research performance assessment, often encoded in rankings systems. At a first glance, this responsibility may appear to be incompatible with responsibility 2 – if metadata should be strongly coupled to its provenance and context, why should it be divorced from the ranking use case? We regard this issue as being similar to a separation of evaluation bodies and those being evaluated. Because of the different choices that different scientometric platforms make with regards to precision on recall, the same ranking methodology can lead to different results when implemented over different scientometric platforms. However, rankings systems are often entangled with single systems, providing perverse incentives for institutions to engage (both in terms of investment and data quality feedback) with one dataset over another.
One benefit of the focus on persistent identifiers that the Barcelona Declaration is that information assessment models can (and should) be constructed without reference to individual scientometric datasets. By decoupling data aggregations from the rankings themselves, we allow new data aggregation services to emerge without locking in single sources of truth. In this way scientometric data sources should be treated like Large Language Models LLM – extraordinarily useful, but with an ability to swap out one for another. Perhaps we need to add another R (replaceable) to FAIR data principles for scientometric datasets.
The decoupling of data from ranking also has another effect, in that it discourages investment in the data quality of a single system, and focuses on either improving data at the source (for instance Crossref) or by improving independent disambiguation algorithms (such as those offered by the Research Organization Registry).
To develop an independent rankings infrastructure will require agreement to use not only the persistent identification infrastructure that we have, but a commitment to develop systems that refer to external classification systems.
Can we go further? Building on a commitment to independent rankings infrastructure for instance, Is it reasonable to expect a common query language for scientometric research and analysis across scientometric systems?
The beginning of a conversation…
Finally, from the exploration above, we hope that we have made the case that our responsibilities as metadata consumers go beyond simple considerations of licence or platform. With the current state-of-the-art in research infrastructure our experiences of how to facilitate open data are not embedded in metadata and do not travel with it. How we use metadata places unclear expectations on others, and affects perceptions of trust in our analysis or in the research information system more generally. As the Barcelona Declaration moves from declaration to implementation, perhaps even blending with evolving national persistent identifier strategies, we hope that these considerations form part of the continuing conversation.