Documenting Data

It is good practice to start documenting your data at the very beginning of your research project, and to keep it up as the project progresses.

Good documentation makes your data discoverable, understandable and usable by yourself and others. It includes all contextual information that a future user may need to interpret your data, for example information about when, why, and by whom the data was created, what methods were used to collect the data, and any explanations of acronyms, coding, or jargon.

There are three ways you can add documentation to your data

  • data-level (embedded) documentation, which covers descriptions and annotations that are embedded in a data file
  • study-level (supporting) documentation, which describes the research project, the data creation process, and the general context, and is usually not embedded in a data file
  • catalogue metadata for discovery and machine-readable description
Data-level documentation
Data-level documentation describes the data that is contained within a file. This information can be integrated into the data file, for example as a header in a spreadsheet. It can also be recorded in a separate document, often a .txt file.

According to the UK Data Archive, documentation at the data level includes

  • names, labels and descriptions for variables, records and their values
  • explanation of codes and classification schemes used
  • codes of, and reasons for, missing values
  • derived data created after collection, with code, algorithm or command file used to create them
  • weighting and grossing variables created and how they should be used
  • data list describing cases, individuals or items studied, for example for logging qualitative interviews
Study-level documentation
Study-level documentation is usually contained in a separate file that accompanies your data. It provides context to your research project.

This type of documentation can be seen as all the information necessary to allow for reuse.

It could also be seen as the information necessary to support reproducibility of your data: in experimental research this would be the information needed to re-run the experiment so that results can be confirmed; in observational research this would be the information required to derive the final results from your raw data, or to collect new data that may legitimately be compared with the original.

According to the UK Data Archive, good study-level data documentation includes information on

  • the context of data collection: project history, aims, objectives and hypotheses
  • data collection methods: data collection protocols, sampling design, instruments used, hardware and software used, data scale and resolution, temporal coverage and geographic coverage, and digitisation or transcription methods
  • structure of data files, number of cases, records, variables and relationships between files
  • data sources used and provenance of materials, eg for transcribed or derived data
  • data validation, checking, proofing, cleaning and other quality assurance procedures carried out, such as checking for equipment and transcription errors, calibration procedures, data capture resolution and repetitions, or editing, proofing or quality control of materials
  • modifications made to data over time since their original creation and identification of different versions of datasets
  • for time series or longitudinal surveys, changes made to methodology, variable content, question text, variable labelling, measurements or sampling
  • information on data confidentiality, access and use conditions, where applicable

Often you will have already given sufficient study-level documentation in the form of laboratory notebooks, questionnaires, interview guides and protocols, working papers, final project reports, and publications. You can include these in your dataset, or refer to them if they are deposited in SHURA or otherwise made publicly available.

Catalogue metadata
Metadata simply means ‘data about data’. It can refer to all the contextual information that describes your data, as described above, but the term is often used more specifically to indicate highly structured information that conforms to certain international standards — a list of fields — and that is machine readable.

The UK Data Archive has an excellent overview of catalogue metadata.

Catalogue metadata are usually assigned to your datasets by a repository or data archive at the moment when you deposit your materials with them. Examples of catalogue metadata are

  • creator
  • title
  • description
  • keywords

These catalogue metadata include all the information that is necessary for researchers that re-use your data, to cite your dataset appropriately.

Repositories may have different metadata requirements. For example, the SHU Research Data Archive (SHURDA) adds several metadata fields to the catalogue metadata, which include

  • keywords
  • collection period
  • geographic coverage
  • data collection method
  • data processing and preparation activities (which covers how the data was processed after it was collected)
  • resource language
  • additional information
More information
Guidance and further reading

  • The UK Data Archive has an excellent overview of documenting your data.
  • The University of Cambridge’s research data management Web pages offer information on Documentation and Metadata.

Online training