Research Data and Open Research Data
Research Data
Open Research Data
Sources of research data
- raw data is collected data that has not been processed for use.
- observational data is captured in real-time, and is usually irreplaceable, for example sensor data, survey data, sample data, and neuro-images.
- experimental data is captured from lab equipment. It is often reproducible, but this can be expensive. Examples of experimental data are gene sequences, chromatograms, and toroid magnetic field data.
- simulation data is generated from test models where model and metadata are more important than output data. For example, climate models and economic models.
- derived or compiled data has been transformed from pre-existing data points. It is reproducible if lost, but this would be expensive. Examples are data mining, compiled databases, and 3D models.
- reference or canonical data is a static or organic conglomeration or collection of smaller (peer-reviewed) datasets, most probably published and curated. For example, gene sequence databanks, chemical structures, or spatial data portals.
Types of research data
- diary,
- laboratory notebook, field notebooks, notes from experiments
- laboratory reports, methodology descriptions
- text files, spreadsheet
- survey, interview questionnaire
- test responses
- photographs, slides
- presentations
- audiotapes and videotapes
- artefacts, specimens, samples
- data files
- standard procedure, operations protocol
- mathematical models, algorithms
- software
- computer simulation results
- transcripts, codebooks
Files formats
Your file format influences your ability to open a file at a later date. Proprietary file formats require the proper version of the proprietary software. Non-proprietary, or open, formats are more inter-operable and thus more durable. Saving your data in open, unencrypted and uncompressed formats will make your data more usable for years to come. If you can’t save your data in an open format, consider including the software name, version, and parent company in the accompanying readme.txt file for future users.
Źródło: File Formats – Research Data Management – Best Practices – Research Guides at Ohio State University
FAIR Principles
FAIR’s principles are constantly evolving as more and more organizations and institutions are interested in implementing good research data management practices.
More information about FAIR Principles you can find here:
Release licenses for research data
CREATIVE COMMONS LICENSES FOR RESEARCH DATA
![]() | CC0 – No Rights Reserved (NCN preferred) | Allows the transfer of the dataset into the public domain, allowing users to use the dataset without restrictions and without any obligations |
![]() | CC BY 4.0 – Attribution (acceptable by NCN until 31.12.2025) | Allows users to copy, modify, distribute and create new works or collections based on the licensed dataset, provided that the authorship of the dataset is indicated, allows the use of the dataset for commercial purposes |
![]() | CC BY-NC 4.0 – Attribution-NonCommercial | Allows users to copy, modify and distribute the licensed dataset for non-commercial purposes only, provided that the authorship of the dataset is indicated |
![]() | CC BY-SA 4.0– Attribution-ShareAlike | Allows users to copy, modify and distribute the dataset provided that they credit the author and share the original and modified data under the same licence |
![]() | CC BY-NC-SA 4.0 – Attribution-NonCommercial-ShareAlike | Allows users to copy, modify and distribute the dataset, provided that both the original and modified data are made available under the same licence and for non-commercial purposes only |
![]() | CC BY-ND 4.0 – Attribution-NoDerivatives | Allows users to re-use the dataset, provided that authorship is indicated. However, the licence does not allow modification of the dataset. It is not advisable for licensing research data, as it makes it virtually impossible to continue working on the data |
![]() | CC BY-NC-ND 4.0 – Attribution-NonCommercial-NoDerivatives | Allows users to download and share the dataset, provided that authorship is indicated. The collection cannot be modified or commercially used. This is the most restrictive of the licences. It is not advisable for licensing research data, as it makes it virtually impossible to continue working on the data |
| DATABASE LICENSES | |
|---|---|
| PDDL (Public Domain Dedication and License PDDL) | The public domain for databases, implies the unrestricted possibility to download, share and modify the database |
| ODC (Open Data Commons Attribution License ODC-BY) | Allows copying, modification of the database provided that the authorship of the database is indicated |
| ODbL (Database License ODC – ODbL) | Allows copying, modification and distribution of the database provided that the authorship of the database is indicated and that the modified database is distributed under the same conditions as the original database was made available |
| SOFTWARE LICENSES | |
|---|---|
| GNU GPL – General Public License | Allows the programme to be run, analysed, distributed and improved for any purpose. Derivative works must be made available under this licence, including modified source code |
| GNU LGPL – Lesser General Public License | Allows the programme to be run, analysed, distributed and improved for any purpose. It imposes restrictions known as copyleft only on individual source files. The licence commits only the source code (source files) in its original version to be made available without derivative works. |
Organisation of folders
Proper organisation of research data is extremely important to avoid confusion in project files. The organisation of files must be understandable to the author, the entire research team and any potential individual with access to the data.
When working in a group or preparing to share a dataset, it is essential to use the clearest possible folder structure.
Furthermore:
- the structure should be agreed and accepted by all researchers;
- folder names should always be short and unambiguous so that it is immediately clear what data is in the folder;
- if the folder structure is complex due to the scope of the project, each major collection should be accompanied by a README file characterising the collection;
- the hierarchy of folders should be coherent and logical (starting from a general folder, moving on to more detailed ones). The design of the folders should not be too deep or too shallow, depending on the size of the project, it can have 3-4 levels;
- as part of a storage strategy, it may be useful to additionally define ‘temporary folders’ from which data can be safely deleted after use.
Things to avoid:
- naming a general folder “current stuff”;
- naming folders for the researcher (folder names should refer to the content, not the authors);
- creating folders with the same names in different locations;
- create copies in different folders, if necessary you can use shortcuts to preserve the reference file.
Naming files
File names can provide a lot of information about the content of the files. They should be coherent, logical, descriptive, short and clear. During teamwork, it is necessary to establish a naming convention to avoid mistakes.
What a file name can contain:
- acronym of the current project or experiment (2-5 letters), so that it is clear what the file refers to;
- short description of the content (1-3 words);
- information about the location or coordinates (if needed);
- date;
- initials of the person (researcher or entity), or the whole surname and first name, always starting with the surname, e.g. KowalskiJ or Kowalski-Jakub
Useful tips:
- the elements of description should be organised from the general to the specific;
- spaces should be avoided; other options may be used and mixed to ensure readability:
- CamelCase (a system of notation for text sequences in which consecutive words are written together, beginning each successive word with a capital letter (except the first). For example: foreColor, setConnection, isPaymentPosted;connectors (-) may be used
- underscores (_) may be used
- when numbering files, always use multiple digits (e.g. 001 instead of 1) to avoid sorting problems;
- when using dates, always use the ISO standard (year first, then month and day):YYYYMMDD e.g. 20240528 or 2024-05-28. This can be shortened to a year or a year and month, depending on your needs and context;
- using the time, it should be noted in the HHMMSS diagram (hours, minutes, seconds)
- never use special characters such as: ę!?*&#.
Data version management
When working with data or documents, it is necessary to store different versions of them. This will minimize the risk of data loss or enable to return to an earlier version if an error occurs. In that case, the researcher needs to know which version is which.
The most simple way is to use a prefix of the file name, version number, date and/or the researcher’s initial e.g:
- file_name_v02.pdf – this is the second main version of the file
- file_name_v02-01.pdf – this is the first version of version 2
- file_name_20230915.pdf – this is the version dated 15 September 2023
- file_name_AN.pdf – this is the version prepared/revised by Anna Nowak
File names must be adapted to the nature of the conducted research. It is important to label the versions so that they are clear to the author as well as the entire research team.
Metadata
Research data metadata is the basic information used to describe the entire dataset made public, e.g. in a repository. This description should be prepared according to certain established principles.
There are many metadata standards: general, domain and institutional. The general standards are Dublin Core, Data Cite and the Data Documentation Initiative (DDI). They are universal and widely used.
RODBUK – Cracow Open Research Data Repository – uses Dublin Core standard to describe the deposited research data.
Dublin Core standard fields:
- date;
- format;
- identifier;
- language;
- description;
- link;
- rights;
- type;
- subject;
- creator;
- title;
- collaboration;
- publisher;
- coverage; source.
README file
What are the reasons for preparing such a file and attaching it to a dataset?
- reliable description of the project and the data will allow a better understanding of the contents of the dataset;
- information on which software to use for closed formats, if any, will speed up the use of the data;
- description of how the data are structured in the dataset will allow the data to be fully verified and used;
- comprehensive and complete description can increase interest in the research, which may result in new cooperation opportunities.
The README file should contain:
- title of dataset, description and purpose of study;
- name (ORCID)/institution/contact details;
- information on the method and procedures for data collection;
- time scope of the study;
- research tools;
- data organising structure:
- folder structure;
- file naming system (with examples);
- relationships and dependencies between files;
- other documentation files in the dataset (notes, accompanying files);
- for each major file, a brief description of its contents and date of creation;
- description of the file versioning system, if applicable.
- software used for data collection and processing, including version numbers;
- file formats used in the data collection and recommended software;
- quality control procedures applied;
- logbook of changes to the dataset;
- licence under which the data are made accessible.
In the case of extensive documentation, it’s a good idea to prepare a table of contents at the beginning of the document, linking to the relevant headings.
It is important that the readme file is properly positioned – it should be displayed first in the structure of all files added to the dataset. Dataverse organizes files alphabetically, so it is sufficient to add zeros before its name, e.g., 00_readme, to ‘force’ the positioning of the readme file.
The example form of the README file.







