Selecting

One of the first things to do when preparing your data for preservation and sharing is to select the data that you are going to keep.

Decisions on what data to keep are left to the discretion of researchers, taking into account

  • all relevant SHU policies
  • requirements and contractual arrangements from the relevant research funder(s), sponsor(s) and contractual partners
  • guidelines and requirements from the repository or archive where you intend to deposit your data
  • guidance available within the relevant subject domains (good practice within your discipline)

Decisions about what data to keep should ideally be considered at the planning stage, ie when you write your data management plan and obtain ethical approval, and in any case well before the end of your project.
Guidelines for selecting data

According to guidance from the Digital Curation Centre, at least three considerations should be made when determining what primary research data to keep

1. What is the purpose that the data could fulfil?
Datasets can be defined by the purpose of keeping them

  • verification — the dataset supports research outputs such as journal articles, PhD theses, and patent applications
  • further analysis — the dataset is of long-term value and could be useful to you and other researchers at some point in the future

The minimum requirements for keeping datasets depend on this purpose.

Verification

The preserved dataset(s) should allow full scrutiny of the research output; this should usually consist of the dataset that was used to reach the conclusions in the research output, and any additional data that is required to replicate the reported study findings in their entirety. This is known as the ‘replication standard’.

Further analysis

The preserved dataset(s) would normally consist of the raw primary data that was collected or created, possibly after any noise has been removed, and always under the condition that these data are fully documented in such a way that they are usable by other researchers within relevant subject domain(s).

2. What data must be kept (or destroyed) because of policies and regulations
Policies and regulations may require you to keep certain data

  • SHU’s Research Data Management Policy. This policy specifically states that all research data that underpin publications and patent applications must be kept, as well as data that can be considered of long-term value
  • SHU’s University Records Retention Schedule. This schedule states that primary research data must be kept for a minimum of 10 years after the end of the research project, and that ethics records such as consent forms must also be kept
  • your research funder’s or other sponsor’s Research Data Management requirements
  • any contractual or legal reasons to keep certain data, eg data that is used in a patent application; data that falls under contractual terms and conditions when working with external partners such as NHS, governmental bodies and SMEs; data that underpins evaluative reports that could be legally challenged
  • any requirements or restrictions imposed by the repository or archive where you intend to deposit the data; these could be a national archive (eg UK Data Archive), a generic data sharing platform (eg Dryad, figshare), a publisher who may want to include the data as supplementary material to your research article, or the SHU Research Data Archive (SHURDA)
  • ethics approval for your research project

The General Data Protection Regulation (GDPR) and Freedom of Information Act (FOI) may require you to keep and/or destroy certain data.

General Data Protection Regulation

The GDPR states that non-anonymised personal and sensitive personal data ‘processed for any purpose or purposes shall not be kept for longer than is necessary for that purpose or those purposes’. This means that non-anonymised data can usually only be kept beyond the duration of the research project if the conditions for the research exemption in the GDPR are met, i.e.:

  • the data must not be used to ‘support measures or decisions with respect to particular individuals’
  • the data are not processed in such a way that substantial damage or substantial distress is, or is likely to be caused to any data subject’

In addition it can be shown that the data is of long-term academic interest to the researcher or the academic community and that the data will be protected against unauthorised access. The University has published extensive data protection guidance.

Freedom of Information

The University may be required to disclose some research data to third parties via a Freedom of Information (FOI) or Environmental Information Regulations (EIR) request. Once the request has been received, it is a criminal offence to delete the data or datasets that have been requested (a so-called ‘shredding offence’). There may be reasons not to provide the requested data; these could be the same as the constraints on data sharing you may have mentioned in your data management plan. Furthermore, on 1 October 2014 a new exemption for research has been added to the FOI Act which states that research data may be exempt from FOI requests if

  • the data will be used for a future publication
  • and disclosure of the data before the date of publication would, or would be likely to, prejudice
    • the research programme
    • the interests of any individual participating in the programme
    • the interests of the authority which holds the data and/or the authority which anticipates to publish

This and other exemptions need to be considered on a case by case basis.

The University has published extensive Freedom of Information guidance. The Information Commissioner has published guidance on FOI and research data. Jisc have also published an overview how FOI relates to research data.

3. What data should be kept because it is of long-term value
Here is a short checklist that may help you to determine whether your data may be of long-term value

  • Is the data of good enough quality in terms of completeness, sample size, accuracy, validity, reliability or any other criterion relevant in your subject domain?
  • Is the data sufficiently documented to allow re-use by your peers?
  • Is there likely to be a demand for your data?
  • Is it difficult to replicate your data?
  • Are the barriers to re-using your data sufficiently low for the intended or likely audience of your research data? For example, does it require proprietary hardware or software, and if so, how widely used are these in your field of study?

If your data can be considered of long-term value for any of the reasons above, as a general rule your data should be kept.

The following classification of research data, based on the 2008 report Stewardship of digital research data – principles and guidelines from the Research Information Network (archived), may be useful to help you determine your data’s long term value.

data long-term value examples
observational data these data are captured in real-time and usually cannot be reproduced; they are primary candidates for archiving observations of ocean temperature on a specific date, medical scans and images, SEM images, interviews, and surveys
experimental data these data are captured from laboratory equipment and are usually reproducible but reproduction may be costly or too complex to reproduce because of all the experimental variables gene sequences, chromatograms, mircoassays
computational or simulation data these data are generated by computational or simulation models; when complete information about the computer model and its execution (eg hardware, software, input data) is preserved, the output can in theory be reproduced; the model and its associated metadata may be more important than the output from the model climate, mathematical and economic models
derived or compiled data these data result from processing or combining ‘raw’ data and are often reproducible but reproduction may be costly text and data mining, compiled databases

 

Case studies

1. Questionnaires

A researcher collects information via paper questionnaires with both open ended and closed questions. Informed consent is captured on paper forms. The information in the questionnaires is digitally recorded in an Excel spreadsheet and the quantitative data is analysed in SPSS.

Paper consent forms
The paper consent forms should be kept as a responsibility of the individual researcher, as stated in the University’s records retention schedule. These consent forms do not need to be shared.

Paper questionnaires
The paper questionnaires contain the raw primary data. If these data are digitally recorded, for example in an Excel spreadsheet including transcriptions of written answers to open ended questions, it may not be necessary to keep the paper questionnaires in which case they may be shredded. Otherwise it is essential to keep the paper questionnaires.

All paper data — consent forms and questionnaires — can be deposited in their paper form in the SHU Research Data Archive (SHURDA).

Digital processed data
The Excel spreadsheet in which the answers are recorded and analysed should be kept and documented. A pdf of the original questionnaire should also be retained.

Further documentation
When keeping your data for the long term it would also need sufficient documentation. There are two levels of documentation

  • data-level documentation, which covers descriptions and annotations at the file and within-file level,
  • study-level documentation, which describes the research project, the data creation process, and the general context.

Study-level documentation could be provided in a separate file outlining the research context and introducing the constituent parts of your dataset, but often you will have given sufficient study-level documentation in any research outputs that are based on these data, such as publications and final reports to funders. You can include these in your dataset, or refer to them if they are deposited in SHURA or otherwise publicly available.

2. Interviews

A researcher interviews a number of participants who have given consent for their interviews to be audio recorded, transcribed, and their data to be shared once anonymised. The audio recordings are transcribed, and analysed using NVivo.

Paper consent forms
As above.

Audio recordings and transcriptions
The audio recordings may contain valuable information that cannot be fully captured in transcription but may be considered useful for future analysis. These files may be kept, but because they may be difficult to anonymise (just as video files would be difficult to anonymise) it may not be possible to share the audio recordings with others. It should be feasible, however, to anonymise the transcriptions, and share these as the primary data emanating from this research project.

Analysis in NVivo
The analysis in NVivo can be fully documented and saved. The University of Edinburgh’s online learning modules MANTRA Research Data Management Training has an excellent data handling tutorial in NVivo.

Further documentation
As above.

3. Laboratory measurements

A researcher produces experimental data by taking measurements with laboratory equipment located in the basement of the Harmer building. These experiments are documented in lab notebooks and the measurements are taken with proprietary software and saved as CSV files. These raw data are then entered into Excel spreadsheets, where any noise in the data is removed. Analysis of the data, resulting in graphs where the measurements are plotted against time and compared to calibration data from previous experiments, usually takes place in Excel as well.

Lab notebooks
This research produces a number of datasets which may need to be kept. It may be that only part of the paper lab notebooks used in the experiments are relevant for this particular project. These pages can be digitised and added to the digital dataset as necessary study-level documentation. If whole lab notebooks are relevant to the research project, than these may be kept in their non-digital form and deposited in the SHU Research Data Archive. When these analogue notebooks are deposited, they should be referred to in the digital dataset that is deposited, for example by using a persistent URL or DOI.

CSV and Excel files
Depending on common practice in the discipline and the judgment of the researcher in question, either the raw data and/or the processed data may need to be kept. In this case study, the measurements captured in CSV files directly from the laboratory equipment constitute the raw data. The Excel files in which any noise is removed are the processed data. The “replication standard” would require the raw data to be made available, as well as a clear description of how this data was processed in order to arrive at the results in the research paper in such a way that peers are able to replicate the results.