One of the first things to do when preparing your data for preservation and sharing is to select the data that you are going to keep.
Decisions on what data to keep are left to the discretion of researchers, taking into account
Decisions about what data to keep should ideally be considered at the planning stage, ie when you write your data management plan and obtain ethical approval, and in any case well before the end of your project.
Guidelines for selecting data
According to guidance from the Digital Curation Centre, at least three considerations should be made when determining what primary research data to keep
The minimum requirements for keeping datasets depend on this purpose.
Verification
The preserved dataset(s) should allow full scrutiny of the research output; this should usually consist of the dataset that was used to reach the conclusions in the research output, and any additional data that is required to replicate the reported study findings in their entirety. This is known as the ‘replication standard’.
Further analysis
The preserved dataset(s) would normally consist of the raw primary data that was collected or created, possibly after any noise has been removed, and always under the condition that these data are fully documented in such a way that they are usable by other researchers within relevant subject domain(s).
The General Data Protection Regulation (GDPR) and Freedom of Information Act (FOI) may require you to keep and/or destroy certain data.
General Data Protection Regulation
The GDPR states that non-anonymised personal and sensitive personal data ‘processed for any purpose or purposes shall not be kept for longer than is necessary for that purpose or those purposes’. This means that non-anonymised data can usually only be kept beyond the duration of the research project if the conditions for the research exemption in the GDPR are met, i.e.:
In addition it can be shown that the data is of long-term academic interest to the researcher or the academic community and that the data will be protected against unauthorised access. The University has published extensive data protection guidance.
Freedom of Information
The University may be required to disclose some research data to third parties via a Freedom of Information (FOI) or Environmental Information Regulations (EIR) request. Once the request has been received, it is a criminal offence to delete the data or datasets that have been requested (a so-called ‘shredding offence’). There may be reasons not to provide the requested data; these could be the same as the constraints on data sharing you may have mentioned in your data management plan. Furthermore, on 1 October 2014 a new exemption for research has been added to the FOI Act which states that research data may be exempt from FOI requests if
This and other exemptions need to be considered on a case by case basis.
The University has published extensive Freedom of Information guidance. The Information Commissioner has published guidance on FOI and research data. Jisc have also published an overview how FOI relates to research data.
If your data can be considered of long-term value for any of the reasons above, as a general rule your data should be kept.
The following classification of research data, based on the 2008 report Stewardship of digital research data – principles and guidelines from the Research Information Network (archived), may be useful to help you determine your data’s long term value.
data | long-term value | examples |
observational data | these data are captured in real-time and usually cannot be reproduced; they are primary candidates for archiving | observations of ocean temperature on a specific date, medical scans and images, SEM images, interviews, and surveys |
experimental data | these data are captured from laboratory equipment and are usually reproducible but reproduction may be costly or too complex to reproduce because of all the experimental variables | gene sequences, chromatograms, mircoassays |
computational or simulation data | these data are generated by computational or simulation models; when complete information about the computer model and its execution (eg hardware, software, input data) is preserved, the output can in theory be reproduced; the model and its associated metadata may be more important than the output from the model | climate, mathematical and economic models |
derived or compiled data | these data result from processing or combining ‘raw’ data and are often reproducible but reproduction may be costly | text and data mining, compiled databases |
1. Questionnaires
A researcher collects information via paper questionnaires with both open ended and closed questions. Informed consent is captured on paper forms. The information in the questionnaires is digitally recorded in an Excel spreadsheet and the quantitative data is analysed in SPSS.
Paper consent forms
The paper consent forms should be kept as a responsibility of the individual researcher, as stated in the University’s records retention schedule. These consent forms do not need to be shared.
Paper questionnaires
The paper questionnaires contain the raw primary data. If these data are digitally recorded, for example in an Excel spreadsheet including transcriptions of written answers to open ended questions, it may not be necessary to keep the paper questionnaires in which case they may be shredded. Otherwise it is essential to keep the paper questionnaires.
All paper data — consent forms and questionnaires — can be deposited in their paper form in the SHU Research Data Archive (SHURDA).
Digital processed data
The Excel spreadsheet in which the answers are recorded and analysed should be kept and documented. A pdf of the original questionnaire should also be retained.
Further documentation
When keeping your data for the long term it would also need sufficient documentation. There are two levels of documentation
Study-level documentation could be provided in a separate file outlining the research context and introducing the constituent parts of your dataset, but often you will have given sufficient study-level documentation in any research outputs that are based on these data, such as publications and final reports to funders. You can include these in your dataset, or refer to them if they are deposited in SHURA or otherwise publicly available.
2. Interviews
A researcher interviews a number of participants who have given consent for their interviews to be audio recorded, transcribed, and their data to be shared once anonymised. The audio recordings are transcribed, and analysed using NVivo.
Paper consent forms
As above.
Audio recordings and transcriptions
The audio recordings may contain valuable information that cannot be fully captured in transcription but may be considered useful for future analysis. These files may be kept, but because they may be difficult to anonymise (just as video files would be difficult to anonymise) it may not be possible to share the audio recordings with others. It should be feasible, however, to anonymise the transcriptions, and share these as the primary data emanating from this research project.
Analysis in NVivo
The analysis in NVivo can be fully documented and saved. The University of Edinburgh’s online learning modules MANTRA Research Data Management Training has an excellent data handling tutorial in NVivo.
Further documentation
As above.
3. Laboratory measurements
A researcher produces experimental data by taking measurements with laboratory equipment located in the basement of the Harmer building. These experiments are documented in lab notebooks and the measurements are taken with proprietary software and saved as CSV files. These raw data are then entered into Excel spreadsheets, where any noise in the data is removed. Analysis of the data, resulting in graphs where the measurements are plotted against time and compared to calibration data from previous experiments, usually takes place in Excel as well.
Lab notebooks
This research produces a number of datasets which may need to be kept. It may be that only part of the paper lab notebooks used in the experiments are relevant for this particular project. These pages can be digitised and added to the digital dataset as necessary study-level documentation. If whole lab notebooks are relevant to the research project, than these may be kept in their non-digital form and deposited in the SHU Research Data Archive. When these analogue notebooks are deposited, they should be referred to in the digital dataset that is deposited, for example by using a persistent URL or DOI.
CSV and Excel files
Depending on common practice in the discipline and the judgment of the researcher in question, either the raw data and/or the processed data may need to be kept. In this case study, the measurements captured in CSV files directly from the laboratory equipment constitute the raw data. The Excel files in which any noise is removed are the processed data. The “replication standard” would require the raw data to be made available, as well as a clear description of how this data was processed in order to arrive at the results in the research paper in such a way that peers are able to replicate the results.