Research Data and Open Research Data

Research Data 

Research data is any information that has been collected, observed, generated or created to validate original research findings. Sharing your research data allows you to verify the published research results and makes it possible to reuse your data in future research. Although usually digital, research data also includes non-digital formats such as laboratory notebooks and diaries. Non-digital data such as laboratory notebooks, ice-core samples, and sketchbooks are often unique. You should assess the long-term value of any non-digital data and plan how you will describe and retain them. 

Open Research Data 

Open Research Data is data that can be freely used, reused and redistributed by anyone. Open access to research data fits within the Open Access idea. The main goals of these are to lower access barriers to results of scientific research, to speed up the research process and to increase the quality of the scholarly record. To make your research data open, you have to deposit them in an open repository (national or international) and publish them by using an open licence, e.g. Creative Commons. 

Sources of research data

  • raw data is collected data that has not been processed for use. 
  • observational data is captured in real-time, and is usually irreplaceable, for example sensor data, survey data, sample data, and neuro-images. 
  • experimental data is captured from lab equipment. It is often reproducible, but this can be expensive. Examples of experimental data are gene sequences, chromatograms, and toroid magnetic field data. 
  • simulation data is generated from test models where model and metadata are more important than output data. For example, climate models and economic models. 
  • derived or compiled data has been transformed from pre-existing data points. It is reproducible if lost, but this would be expensive. Examples are data mining, compiled databases, and 3D models. 
  • reference or canonical data is a static or organic conglomeration or collection of smaller (peer-reviewed) datasets, most probably published and curated. For example, gene sequence databanks, chemical structures, or spatial data portals. 

Types of research data

  • diary,  
  • laboratory notebook, field notebooks, notes from experiments 
  • laboratory reports, methodology descriptions 
  • text files, spreadsheet 
  • survey, interview questionnaire 
  • test responses 
  • photographs, slides 
  • presentations 
  • audiotapes and videotapes 
  • artefacts, specimens, samples 
  • data files 
  • standard procedure, operations protocol 
  • mathematical models, algorithms 
  • software 
  • computer simulation results 
  • transcripts, codebooks 

Files formats

Your file format influences your ability to open a file at a later date. Proprietary file formats require the proper version of the proprietary software. Non-proprietary, or open, formats are more inter-operable and thus more durable. Saving your data in open, unencrypted and uncompressed formats will make your data more usable for years to come. If you can’t save your data in an open format, consider including the software name, version, and parent company in the accompanying readme.txt file for future users.

Recommended Formats
TextPlain text (.txt)
Portable Document Format (.pdf)
LaTeX documents (.tex)
Hypertext Markup Language (.html)
Open Document Format (.odt)
Extensible Markup Language (.xml)
Tables, spreadsheets,
and databases
Tab-separated tables (.txt — sometimes .tsv or .tab)
Comma-separated tables (.csv or .txt)
Other standard delimiter (e.g. colon, pipe)
Fixed-width
OpenDocument Spreadsheet (.ods)
OpenDocument Database (.odb)
Image filesTIFF (.tiff or .tif)
JPEG (.jpg or .jp2)
Portable Network Graphics (.png)
Scalable Vector Graphics (.svg)
Portable Document Format (.pdf)
Graphics Interchange Format (.gif)
Microsoft Windows Bitmap Format (.bmp)
Sound FilesWAVE (.wav)
FLAC (.flac)
MPEG-3 (.mp3 — usually suitable for human voice and
moderate-quality audio, but may not be suitable for
high-fidelity audio)
Audio Interchange File Format (.aiff)
Video FilesMPEG-4 (.mp4)
Material Exchange Format (.mxf)
DatabasesExtensible Markup Language (.xml)
Comma-separated tables (.csv)
Geospatial DataGeo-Referenced TIFF (.tiff)
ESRI Shapefile (.shp, .shx, .dbf)
Keyhole Markup Language (.kml)
Network Common Data Format (.nc)
Web DataJavascript Object Notation (.json)
Extensible Markup Language (.xml)
Hypertext Markup Language (.html)
Web ArchiveWebARChive (.warc)
Multidimensional ArraysCommon Data Format (.cdf)
Network Common Data Format (.nc)
Hierarchical Data Format (usually .hdf or .h5)
E-booksElectronic Publication (.epub)

Źródło: File Formats – Research Data Management – Best Practices – Research Guides at Ohio State University

FAIR Principles 

Findable

Research data should be: 

  • described by using a wide range of metadata,  
  •  have identifier, e.g. DOI,
  •  available in repository 

Accessible

Research data should be published and available under open access programmes if it is possible. 

Interoperable

Research data should be easy to find, read, and process. 

Reusable

Research data should be properly documented and published under an appropriate license, e.g. Creative Commons license. 

FAIR’s principles are constantly evolving as more and more organizations and institutions are interested in implementing good research data management practices.

More information about FAIR Principles you can find here: 

Release licenses for research data

CREATIVE COMMONS LICENSES FOR RESEARCH DATA

CC0 – No Rights Reserved (NCN preferred)Allows the transfer of the dataset into the public domain, allowing users to use the dataset without restrictions and without any obligations
Ikona licencji CC BY CC BY 4.0 – Attribution (acceptable by NCN until 31.12.2025)Allows users to copy, modify, distribute and create new works or collections based on the licensed dataset, provided that the authorship of the dataset is indicated, allows the use of the dataset for commercial purposes
Ikona licencji CC BY-NCCC BY-NC 4.0 – Attribution-NonCommercialAllows users to copy, modify and distribute the licensed dataset for non-commercial purposes only, provided that the authorship of the dataset is indicated
Ikona licencji CC BY-SACC BY-SA 4.0– Attribution-ShareAlikeAllows users to copy, modify and distribute the dataset provided that they credit the author and share the original and modified data under the same licence
Ikona licencji CC BY-NC-SACC BY-NC-SA 4.0 – Attribution-NonCommercial-ShareAlikeAllows users to copy, modify and distribute the dataset, provided that both the original and modified data are made available under the same licence and for non-commercial purposes only
Ikona licencji CC BY-NDCC BY-ND 4.0 – Attribution-NoDerivativesAllows users to re-use the dataset, provided that authorship is indicated. However, the licence does not allow modification of the dataset. It is not advisable for licensing research data, as it makes it virtually impossible to continue working on the data
Ikona licencji CC BY-NC-NDCC BY-NC-ND 4.0 – Attribution-NonCommercial-NoDerivativesAllows users to download and share the dataset, provided that authorship is indicated. The collection cannot be modified or commercially used. This is the most restrictive of the licences. It is not advisable for licensing research data, as it makes it virtually impossible to continue working on the data
DATABASE LICENSES
PDDL (Public Domain Dedication and License PDDL)The public domain for databases, implies the unrestricted possibility to download, share and modify the database
ODC (Open Data Commons Attribution License ODC-BY)Allows copying, modification of the database provided that the authorship of the database is indicated
ODbL (Database License ODC – ODbL)Allows copying, modification and distribution of the database provided that the authorship of the database is indicated and that the modified database is distributed under the same conditions as the original database was made available
SOFTWARE LICENSES
GNU GPL – General Public LicenseAllows the programme to be run, analysed, distributed and improved for any purpose. Derivative works must be made available under this licence, including modified source code
GNU LGPL – Lesser General Public LicenseAllows the programme to be run, analysed, distributed and improved for any purpose. It imposes restrictions known as copyleft only on individual source files. The licence commits only the source code (source files) in its original version to be made available without derivative works.

Organisation of folders

Proper organisation of research data is extremely important to avoid confusion in project files. The organisation of files must be understandable to the author, the entire research team and any potential individual with access to the data.

When working in a group or preparing to share a dataset, it is essential to use the clearest possible folder structure.

Furthermore:

  • the structure should be agreed and accepted by all researchers;
  • folder names should always be short and unambiguous so that it is immediately clear what data is in the folder;
  • if the folder structure is complex due to the scope of the project, each major collection should be accompanied by a README file characterising the collection;
  • the hierarchy of folders should be coherent and logical (starting from a general folder, moving on to more detailed ones). The design of the folders should not be too deep or too shallow, depending on the size of the project, it can have 3-4 levels;
  • as part of a storage strategy, it may be useful to additionally define ‘temporary folders’ from which data can be safely deleted after use.

Things to avoid:

  • naming a general folder “current stuff”;
  • naming folders for the researcher (folder names should refer to the content, not the authors);
  • creating folders with the same names in different locations;
  • create copies in different folders, if necessary you can use shortcuts to preserve the reference file.

Naming files

File names can provide a lot of information about the content of the files. They should be coherent, logical, descriptive, short and clear. During teamwork, it is necessary to establish a naming convention to avoid mistakes.

What a file name can contain:

  • acronym of the current project or experiment (2-5 letters), so that it is clear what the file refers to;
  • short description of the content (1-3 words);
  • information about the location or coordinates (if needed);
  • date;
  • initials of the person (researcher or entity), or the whole surname and first name, always starting with the surname, e.g. KowalskiJ or Kowalski-Jakub

Useful tips:

  • the elements of description should be organised from the general to the specific;
  • spaces should be avoided; other options may be used and mixed to ensure readability:
    • CamelCase (a system of notation for text sequences in which consecutive words are written together, beginning each successive word with a capital letter (except the first). For example: foreColor, setConnection, isPaymentPosted;connectors (-) may be used
    • underscores (_) may be used
  • when numbering files, always use multiple digits (e.g. 001 instead of 1) to avoid sorting problems;
  • when using dates, always use the ISO standard (year first, then month and day):YYYYMMDD e.g. 20240528 or 2024-05-28. This can be shortened to a year or a year and month, depending on your needs and context;
  • using the time, it should be noted in the HHMMSS diagram (hours, minutes, seconds)
  • never use special characters such as: ę!?*&#.

Data version management

When working with data or documents, it is necessary to store different versions of them. This will minimize the risk of data loss or enable to return to an earlier version if an error occurs. In that case, the researcher needs to know which version is which.

The most simple way is to use a prefix of the file name, version number, date and/or the researcher’s initial e.g:

  • file_name_v02.pdf – this is the second main version of the file
  • file_name_v02-01.pdf – this is the first version of version 2
  • file_name_20230915.pdf – this is the version dated 15 September 2023
  • file_name_AN.pdf – this is the version prepared/revised by Anna Nowak

File names must be adapted to the nature of the conducted research. It is important to label the versions so that they are clear to the author as well as the entire research team.

Metadata

Metadata is structured information that describes, explains, locates and otherwise facilitates the finding, use or management of an information resource. Metadata is often described as ‘data about data’ or ‘information about information’.
National Information Standards Organization

Research data metadata is the basic information used to describe the entire dataset made public, e.g. in a repository. This description should be prepared according to certain established principles.

There are many metadata standards: general, domain and institutional. The general standards are Dublin Core, Data Cite and the Data Documentation Initiative (DDI). They are universal and widely used.

RODBUK – Cracow Open Research Data Repository – uses Dublin Core standard to describe the deposited research data.

Dublin Core standard fields:

  • date;
  • format;
  • identifier;
  • language;
  • description;
  • link;
  • rights;
  • type;
  • subject;
  • creator;
  • title;
  • collaboration;
  • publisher;
  • coverage; source.

README file

README is a file that is prepared for each dataset and included when the data is deposited in the repository. It should be prepared in an open format e.g. txt., preferably in English even if the project is in Polish or may have two language versions. The README file is intended to provide all the necessary information for the proper understanding, interpretation and reuse of the data. Creating a README file at the beginning of the research process and consistently updating it during the research will facilitate the preparation of the final README file when the data are ready to be deposited.

What are the reasons for preparing such a file and attaching it to a dataset?

  • reliable description of the project and the data will allow a better understanding of the contents of the dataset;
  • information on which software to use for closed formats, if any, will speed up the use of the data;
  • description of how the data are structured in the dataset will allow the data to be fully verified and used;
  • comprehensive and complete description can increase interest in the research, which may result in new cooperation opportunities.

The README file should contain:

  • title of dataset, description and purpose of study;
  • name (ORCID)/institution/contact details;
  • information on the method and procedures for data collection;
  • time scope of the study;
  • research tools;
  • data organising structure:
    • folder structure;
    • file naming system (with examples);
    • relationships and dependencies between files;
    • other documentation files in the dataset (notes, accompanying files);
    • for each major file, a brief description of its contents and date of creation;
    • description of the file versioning system, if applicable.
  • software used for data collection and processing, including version numbers;
  • file formats used in the data collection and recommended software;
  • quality control procedures applied;
  • logbook of changes to the dataset;
  • licence under which the data are made accessible.

In the case of extensive documentation, it’s a good idea to prepare a table of contents at the beginning of the document, linking to the relevant headings.

It is important that the readme file is properly positioned – it should be displayed first in the structure of all files added to the dataset. Dataverse organizes files alphabetically, so it is sufficient to add zeros before its name, e.g., 00_readme, to ‘force’ the positioning of the readme file.

The example form of the README file.