September 30, 2015
I am pleased to present a guest post written by Martin Donnelly from the Digital Curation Centre, University of Edinburgh*.
As part of a shared commitment to transparency and public engagement in the research endeavour, the European Commission has recently begun an open data pilot as part of the Horizon 2020 (H2020) framework programme. This covers around 20% of the entire programme, and includes the ‘Innovation Actions’ and ‘Research and Innovation Actions’. Therefore, much of the H2020 work carried out in universities and other research institutions will be covered by the pilot, and it’s important to understand not only the specific requirements, but also a little about the context in which this change has been brought about. The following is an attempt to explain both of these things in terms familiar to scholars from the social sciences, and related disciplines.
Open culture and its effect on research
Open data and Open Access to publications (OA) combine towards something which is often called “Open Science”, or Open Research to take a less science-centric view. Open Science is situated within a context of ever-greater transparency, accessibility and accountability, wherein stakeholders in the research process – from fellow researchers, funders, and interested members of the general public – increasingly expect to be able to access and reuse the outputs of taxpayer funded research with as few barriers (both related to time to access and costs of access) in their way. The impetus for openness in research comes from two directions: ground-up and top-down. From the grassroots, OA first emerged from the High Energy Physics scholarly research community, who saw benefit in not waiting for traditional publication before sharing research findings (and, subsequently, data and software code). From the other direction, governments and other funders see openness as a catalyst for increasing public and commercial engagement with research, bringing about both societal and commercial benefit.
The main goals of these developments are to lower barriers to accessing the outputs of publicly funded research (or ‘science’ for short), to speed up the research process, and to strengthen the quality, integrity, impact and longevity of the scholarly record, as well as providing better return on investment.
Research Data Management is “the active management and appraisal of data over the lifecycle of scholarly and scientific interest”. Constituent activities include:
- Planning and describing data-related work before it takes place
- Documenting your data so that others can find and understand it
- Storing it safely during the project Depositing it in a trusted archive at the end of the project
- Linking publications to the datasets that underpin them
RDM is increasingly considered to be a part of good research practice, and no longer as a “nice-to-have” optional extra, and the impetus for this comes from both top-down (i.e. the introduction of funder and institutional data policies and mandates) and bottom-up (e.g. disciplinary norms which are dictated by fellow scholars. In fact, Nicola Janz, a political scientist at the University of Cambridge, has a recent blog post in which she questions whether withholding data unnecessarily is just bad practice, or strays into the realm of scientific misconduct.
What we talk about when we talk about data…
Most IRs [institutional repositories] have grown out of initiatives to archive and share digital texts, and they have limited experience dealing with quantitative or qualitative data. Social science data, in particular, such as surveys and vital statistics stored in statistical analysis software, pose special problems for selection, preservation, and documentation.
– Jared Lyle, George Alter, Ann Green, in Ray (2014) Research Data Management: Practical Strategies for Information Professionals
Definitions of data vary from domain to domain, and much of the useful literature around research data management takes a very science-centric, fact-oriented approach, with a focus on reproducibility. This not always possible in the humanities and social sciences, and such disciplines bring their own particular problems to the table. Social science is a particularly tricky subject area when it comes to data. It involves complex, if not always very large, datasets, which often relate to living people, who have rights protected under law (e.g. the Data Protection Act in the UK.) Furthermore, social science data can exist in many different types (from objective, fact-oriented survey data, to more qualitative interview transcripts, to often very subjective and sensitive ethnographic field notes), which adds further complexity to its management in a single archive.
That said, data also has considerable strengths in the social sciences. Qualitative and quantitative data is well understood in the social sciences, and there’s nothing new about data re-use; it’s an integral part of the culture, and always has been. However, it’s often more fraught than scientific data re-use in other areas. For starters, scholars in some social science disciplines of don’t always think of their sources or influences as ‘data’, value and referencing systems may be quite different, and research/production methods are not always rigorously methodical or linear.
These issues are recorded and given life by Asher and Jahnke (2013), who describe the feelings of ethnographers towards their ‘data’.
That’s the thing with fieldnotes, you never show them to anyone… as long as it works for you… it’s not like you have to show it to someone else and have them make sense of it, which is kind of a shame because it would be nice to have data in formats where people could, you know, sort of, archive that information. (2-16-120211)
In his study of fieldnote practices, Jean Jackson observes, “Many respondents point out that the highly personal nature of fieldnotes influences the extent of one’s willingness to share them: ‘Fieldnotes can reveal how worthless your work was, the lacunae, your linguistic incompetence, your not being made a blood brother, your childish temper.’” Anthropologist Simon Ottenberg describes his fieldnotes similarly, “[W]hen I was younger, I would have felt uncomfortable at the thought of someone else using my notes, whether I was alive or dead—they are so much a private thing, so much an aspect of personal field experience, so much a private language, so much part of my ego, my childhood [as an anthropologist], and my personal maturity.” This attachment and ambivalence may make researchers reluctant to turn over control of their notes to an archives—especially one that has the purpose of making materials available publically on the internet. This professional practice not only makes it very difficult to substantiate or verify ethnographers’ data, suggesting a need for ethnographers to develop more proactive and sensitive data-sharing procedures, but also creates a situation where ethnographic materials are saved, but with inadequate plans for preservation. Zeitlyn notes, “Paradoxically most anthropologists want neither to destroy their field material nor archive it.”
– Andrew Asher and Lori M. Jahnke (2013) “Curating the Ethnographic Moment” Archive Journal, Issue 3, [link]
So the sharing and reuse of social science data may be particularly problematic, but the good news is that the social sciences are already very well served by data archives, not just in terms of a physical place to store data, but in terms of the experienced staff with skills in anonymising and aggregating sensitive datasets in order to comply with consent agreements and legal issues. Indeed many of the very first dedicated data archives in the world were dedicated to the social sciences. This was out of necessity, as social science data tends not to be reproducible – you can’t rerun a census, and if the data is lost then it’s gone forever. Humanities and social sciences data is in many cases unique, and non-replicable. It may also be sensitive and needing different levels of protection from the wrong eyes, or inappropriate re-users. Data from surveys or medical trials will relate to living human beings, who have rights which are protected by law. Breaching these rights is likely to be more serious than breaching commercial confidentiality agreements: the potential penalty for the latter will only be financial, whereas for the former it could lead to criminal charges.
In his concluding chapter to Ray (2014), CNI Director Clifford Lynch outlines the issue of sensitive data, and describes potential strategies for dealing with it:
One cornerstone concept in protecting human subjects is informed consent; this includes ensuring that potential subjects understand what data is being collected about them, how long it will be retained, who gets to use it, and an understanding of the specific uses to which it will be put (including the risks of those uses). Even if the potential subjects were willing to sign very general release forms that would facilitate sharing and reuse of data, the use of such consent forms would likely be rejected by the local IRB; at best, some specific and constrained kinds of data reuse, such as a meta-analysis, might be included in an acceptable consent agreement.
Another very problematic area here is the anonymization of data involving human subjects. For some kinds of reuse, an anonymized version of a data collection, which breaks the links between data and the individuals that provided it, is sufficient (though, of course, many other reuse scenarios will require the full data). But researchers in many fields and many contexts, from genomics to information science (query logs), have discovered that it is incredibly difficult to irrevocably anonymize data, particularly if data from multiple sources are merged together. So now we see researchers who want to reuse data being asked to certify that they will not attempt to deanonymize it; even more problematically, there may be some attempt to “qualify” the potential reusers and reuses as “legitimate” in some fashion, which quickly runs contrary to the goals of promoting broad and creative reuses, and engaging industry and the broad general public, not just the research community, in the reuse of data (and particularly data produced with public funding).
– Clifford Lynch, in Ray (2014)
One final issue to note about data in the social sciences is the position they occupy as an interesting crossover point between higher education and the government / public sector. The notion of reusing data that you did not yourself create or collect is entirely natural to social scientists, in a way that it may not yet be to other types of researchers. In the UK, considerable investment has been made in (e.g.) the Administrative Data Research Network, enabling government departments to produce data in such a way that it lends itself more readily to research purposes.
So each academic subject area or domain has its own particular challenges, and strengths and weaknesses, when it comes to producing, managing and reusing data, but the benefits of both Open Access and open data are widely accepted. In June 2013, the G8 science ministers issued a joint statement addressing, among other things, the need for an increasingly global research infrastructure, open scientific research data, and increasing access to the peer-reviewed, published results of scientific research, and the government of another major economy, namely China, has also begun to make clear its support for Open Access and Open Data.
The H2020 data management pilot – background and specifics
The European Commission Horizon 2020 open data pilot covers data (and metadata) needed to validate scientific results, which should be deposited in a dedicated data repository, whether subject-based/thematic, institutional or centralised. The pilot follows on from, and is influenced by, national policy development in EU member states and other countries, notably the USA, where a data management planning requirement for all grant applications was introduced by the National Science Foundation in 2011. The H2020 pilot takes a different tack to other national policies, however, in that it does not require a DMP to be submitted at the application stage, but instead requires three iterations (versions) to be produced and submitted at intervals during the lifetime of the project (6 months in, midway through, and end-project), covering issues of: data types; standards used; sharing/making available; curation and preservation of data.
So far as possible, projects must then take measures to enable for third parties to access, mine, exploit, reproduce and disseminate (free of charge for any user) this research data. The EC suggests attaching Creative Commons Licence (CC-BY or CC0) to the data deposited. At the same time, projects should provide information via the chosen repository about tools and instruments at the disposal of the beneficiaries and necessary for validating the results, for instance specialised software or software code, algorithms, analysis protocols, etc. Where possible, they should provide the tools and instruments themselves.
Finally, as this is a pilot, opt outs are possible, either total or partial. Projects may opt out of the Pilot at any stage, for a variety of reasons, e.g.
- if participation in the Pilot on Open Research Data is incompatible with the Horizon 2020 obligation to protect results if they can reasonably be expected to be commercially or industrially exploited;
- confidentiality (e.g. security issues, protection of personal data);
- if participation in the Pilot on Open Research Data would jeopardise the achievement of the main aim of the action;
- if the project will not generate / collect any research data;
- if there are other legitimate reasons to not take part in the Pilot (to be declared at proposal stage)
(N.B. A much more detailed description and scope of the Open Research Data Pilot requirements is provided on the Participants’ Portal.)
Conclusion and looking forward
It is important to emphasise that the Horizon 2020 data management pilot is just that, a pilot. There are no right and wrong answers at this stage – indeed, no project will be rejected on the basis of an inadequate data management plan because the Commission does not require the first version until six months into the project! – but rather the policymakers will keep an eye on what is submitted to them, and develop policy accordingly. This follows the pattern established in Open Access publishing by the EC in FP7, and data-related policy developments in the UK (notably the Economic and Social Research Council) and the National Science Foundation in the US. If I were forced at gunpoint to make a prediction, I would suggest two things: one, that data sharing will become mandatory (with appropriate exemptions) in FP9, and two, that social science archives will continue to be very well-positioned to advise researchers in a variety of disciplines in the best ways to manage and share their data appropriately, and in-keeping with the new requirements and expectations.
I should stress that researchers and support staff are not being left to deal with this alone. The EU has funded a number of data infrastructure projects, such as EUDAT, and is investing in data management planning assistance as a key part of this work. Open Access Infrastructure for Research in Europe (OpenAIRE) will also become an entry point for linking publications to data.
Technical infrastructure, of course, is only one of three things that needs to be place to facilitate good quality data management, the other two being skills and resources. The EU have been clear that data management costs may be included within H2020 bids, and it funds the FOSTER project to carry out in-person and web-based training and awareness raising, providing a portal to training resources for scientists and other interested parties. FOSTER aims to set in place sustainable mechanisms for EU researchers to integrate Open Science in their daily workflow, supporting researchers to optimise their research visibility and impact and meeting the demands of twenty-first century research.
 – In fact, the term ‘research object’ is gaining in currency, incorporating data (numeric, written, audiovisual….), software code, workflows and methodologies, slides, logs, lab books, sketchbooks, notebooks, etc – basically anything that underpins or enriches the (written) outputs of research.)
 – The EU suggests the Registry of Research Data Repositories (www.re3data.org) for researchers seeking to identify an appropriate repository.
* – Martin Donnelly is Senior Institutional Support Officer at the Digital Curation Centre, University of Edinburgh. His work focuses on data management planning and policy. He wrote a book chapter on data management planning in 2012 (Pryor ed., Managing Research Data, London: Facet), and was the co-author of the DCC’s original “Checklist for a Data Management Plan”. He also conceived and project managed the first three iterations of the DCC’s DMPonline tool. Visit his personal web-page at DCC and Google Scholar and Twitter profiles to learn more.
Image credit: Ainsley Seago. doi:10.1371/journal.pbio.1001779.g001 (cropped) licensed under the terms of CC-BY 4.0