|
Getting your Trinity Audio player ready...
|
Contents
- Glossary of Key Terms
- Key Takeaways
- What Is Research Data Management?
- The Research Data Lifecycle
- The FAIR Principles: The Foundation of Good RDM
- Step 1: Start with Policies, Ethics, and Legal Compliance
- Step 2: Build a Sound Data Collection Strategy
- Step 3: Organize Your Files and Folders
- Step 4: Document Everything: Metadata and README Files
- Step 5: Store Data Securely Using the 3-2-1 Rule
- Step 6: Develop a Data Management Plan (DMP)
- Step 7: Choose Open and Standard File Formats
- Step 8: Share Your Data: The Open Science Imperative
- Practical Implementation: Getting Started
- Continuous Self-Monitoring
- Frequently Asked Questions
Glossary of Key Terms
| Term | Definition |
| Research Data Management (RDM) | The process of collecting, organizing, storing, securing, sharing, and preserving research data across all stages of the research lifecycle |
| Data Lifecycle | The sequence of stages data passes through: creation/collection → processing → analysis → storage → sharing → preservation/reuse |
| FAIR Principles | A framework stating that data should be Findable, Accessible, Interoperable, and Reusable |
| Data Management Plan (DMP) | A formal document outlining how data will be collected, stored, shared, and preserved throughout and after a research project |
| Metadata | Structured data that describes other data: for example, who created a file, when, using what method, and in what format |
| Data Repository | A digital archive where datasets can be deposited and made accessible to other researchers |
| Open File Format | A non-proprietary file format that can be opened without specific paid software and is more likely to remain accessible over time |
| Persistent Identifier (PID) | A long-lasting reference to a digital resource: such as a DOI (Digital Object Identifier): that makes data reliably citable and findable |
| De-identification | The process of removing or masking personally identifiable information from a dataset to protect participant privacy |
| Version Control | A system for tracking changes to files over time, allowing earlier versions to be retrieved |
| Electronic Lab Notebook (ELN) | A digital tool that replaces or supplements traditional paper lab notebooks, enabling structured, searchable documentation of research activities |
| Open Science | A movement that promotes transparent, reproducible, and openly accessible research: including open data, open methods, and open publications |
| Data Governance | The set of policies, roles, and processes that determine how data is managed, accessed, and protected within an organization |
| 3-2-1 Backup Rule | A data backup strategy: maintain 3 copies of data, on 2 different storage media, with 1 copy stored offsite |
Key Takeaways
- Research Data Management is the process of providing appropriate labeling, storage, and access for data at all stages of a research project.
- RDM encompasses all data-related activities across the entire data lifecycle: from collection and processing through to storage, sharing, and long-term preservation.
- Well-organized research data can help researchers improve efficiency, meet funding and regulatory requirements, and optimize the potential of their research: leading to publication, funding, and collaboration opportunities.
- The FAIR principles: Findable, Accessible, Interoperable, and Reusable: serve as the foundational framework for modern RDM and should guide every decision about how data is documented, stored, and shared.
- A Data Management Plan (DMP) is not optional busywork: it is a living document that helps researchers anticipate needs, stay compliant, and maximize the long-term value of their data.
- Open and standard file formats ensure that research data remain accessible and usable over time by avoiding dependencies on proprietary software that may not be supported in the future.
- Data publication promotes transparency, credibility, long-term accessibility, reproducibility, and collaboration. Researchers stand to benefit personally through greater recognition of their work.
- Continuous self-monitoring (setting regular intervals to review data practices, assess progress, and refine workflows) is just as important as the initial setup of any RDM system.
What Is Research Data Management?
Research data is changing in scale and complexity at a pace that was hard to imagine two decades ago. In the 20th century, it was common for a study or experiment to yield a single file: perhaps one data table. Today, many research projects generate many files, often created by multiple collaborators, and often valuable for secondary use. Experiments in genomics can generate multiple raw files per biological sample, plus layers of processed data.
Research Data Management is the process of providing appropriate labeling, storage, and access for data at all stages of a research project. It is not a single activity but a continuous discipline that spans the entire life of a research project: and often extends well beyond it.
Funding agencies now require a data management or data sharing plan to be submitted with grant applications. Many academic journals also require the submission of relevant data alongside manuscripts to promote open access and reproducibility. Early and attentive management at each step of the data lifecycle will ensure the discoverability and longevity of your research.
Why Should Researchers Care?
Beyond compliance, good RDM makes practical sense:
- Organized, well-documented data is simply easier to analyze
- You can find your own files when you need them: sometimes years later
- You avoid drowning in irrelevant or duplicate data
- You protect against data loss from accidents, equipment failure, or staff turnover
- You get credit for your data and avoid accusations of misconduct
- You enable other researchers to build on your work, amplifying your impact
The Research Data Lifecycle
Understanding RDM starts with understanding the data lifecycle: the sequence of stages your data passes through from creation to long-term use. The lifecycle is typically represented as a cycle rather than a straight line, because data created in one project frequently becomes the raw material for future research.
| Lifecycle Stage | What Happens | Key RDM Activities |
| Plan | Research design, DMP creation | Defining data needs, legal/ethical review, storage planning |
| Collect | Data generation or acquisition | Naming conventions, quality control, format selection |
| Process & Analyze | Cleaning, transforming, analyzing | Version control, documentation, code management |
| Store | Securing data during the project | Backup implementation, access controls, encryption |
| Share & Publish | Making data available | Repository selection, licensing, persistent identifiers |
| Preserve & Reuse | Long-term archiving | Format migration, metadata maintenance, enabling secondary use |
The FAIR Principles: The Foundation of Good RDM
FAIR refers to the findability, accessibility, interoperability, and reuse of digital assets. Every step in research data management is closely connected to FAIR.
What Each Principle Means in Practice
- Findable: Metadata and data should be easy to locate for both humans and computers. This means rich metadata, clear naming, and the use of persistent identifiers like DOIs.
- Accessible: There should be clarity on how data can be retrieved: including what authentication or authorization is required. “Accessible” does not necessarily mean “open to everyone”; it means the access conditions are clearly defined.
- Interoperable: Data should be structured in a way that enables it to work with other datasets, tools, and workflows. This is primarily achieved through the use of standard formats and shared vocabularies.
- Reusable: Metadata and data should be sufficiently well-described so that others can reproduce, replicate, or build upon the work in different settings.
It is important to know that FAIR is applicable not only to data but also to metadata and relevant infrastructure.
Step 1: Start with Policies, Ethics, and Legal Compliance
Before collecting a single data point, researchers need to understand the landscape of rules that govern their work.
Applicable Policies
Countries, umbrella organizations, science foundations, professional societies, institutions, funding bodies, and project boards may all issue policies and guidelines. These inherently reflect best practices, outline legal issues, or offer suggestions for efficiency and resource management. Researchers should check for applicable policies and compliance requirements in their subject area by consulting funder websites, institutional research support offices, or DMP tools.
Ethical Regulations and Legislation
Relevant legislation typically pertains to data collection and sharing, designed to safeguard personal data of individuals. For example, within the European Union, individuals possess the right to know which information is collected, processed, and transmitted under the General Data Protection Regulation (GDPR). Ethics committees review proposals for ethical compliance and aim to ensure the rights, safety, and well-being of participants.
Key considerations before starting data collection:
- Does this project involve human participants? If so, IRB/ethics approval is likely required.
- Does the data contain personally identifiable information (PII)? If so, GDPR, HIPAA, or local equivalents may apply.
- Are there intellectual property considerations: for example, if collaborating with industry partners?
- What data retention periods are required by your funder or institution?
Step 2: Build a Sound Data Collection Strategy
To reach well-founded conclusions, researchers require quality data. Understanding how the research question translates into specific data needs: what data are required, what insights are expected, and in what way: is one of the first steps in conducting research.
Reusing Existing Data
Researchers should first check published data to examine whether it can be integrated into the work. Reusing data can be a tremendous benefit and save significant resources, especially labor and material costs, as well as data retention costs in projects with high data volumes.
Collecting New Data
If new original data are required, the guiding principle is to collect as much as required, but no more than necessary. Running a sample size calculation before collecting data ensures collection of the minimum amount of required data. Data collected beyond requirements need additional resources for administration, processing, and storage.
Ensuring Data Quality
Data quality is the degree to which the data at hand can meet their intended purpose while being error-free. The goal of data quality efforts is to assure that the data depicts the real-world entities it measures as comprehensively as possible. Best practices for data quality may vary by discipline. In social sciences, validation through triangulation is common, while in physics, calibration of instruments ensures accuracy.
Step 3: Organize Your Files and Folders
A systematic organization of files and working directories is key to efficient filing, navigation, and prompt file retrieval. Using the same filing scheme across projects and teams can simplify and accelerate interactions.
Folder Structure
Clean working directories have a logical and uniform structure: a standardized folder structure and depth, as well as default folder names. The project folder ideally contains all project files in its logically subdivided subfolders. Using a similar folder structure across projects can facilitate data retrieval and promote standardization while allowing for variations as necessary.
A typical folder hierarchy might look like this:
ProjectName/
├── raw_data/
├── processed_data/
├── analysis/
├── manuscripts/
├── protocols/
├── code/
└── admin/
├── DMP/
└── ethics/
File Naming Conventions
Ensure the file name is descriptive, relevant, and allows for easy sorting and filtering. Incorporate dates, version numbers, and project numbers within the file name. A consistent file naming structure might include: [ProjectCode]_[DocumentType]_[Version]_[Date]. For example: TenR_Man_01_MJH_v01_2025-01-01.docx
Key rules for file naming:
- Use ISO date format (YYYY-MM-DD) for easy chronological sorting
- Avoid spaces: use underscores or hyphens instead
- Avoid special characters that may cause issues across operating systems
- Be consistent across your entire team
Version Control
Have a system in place to track file versions and activities. This will help you identify any changes made to the original file. At any point if you are unsure, you will be able to go back to find the person who made the change and the reasons for it.
Step 4: Document Everything: Metadata and README Files
Orderly and standardized documentation of both data and its collection method is key to understanding and using any kind of data. Data are usually not self-explanatory: with sufficient documentation, the work remains transparent and reproducible and is less likely to be misinterpreted.
What Metadata Should Include
Metadata comprises information on when data were created, by whom, and with which method. It may also include file sizes, formats, and languages. Common forms include README files, data dictionaries, or computer-readable XML/JSON files.
A README file should minimally contain:
| README Element | Description |
| Explanation of data included | What the dataset contains and what it represents |
| Original purpose / project affiliation | The research question and project the data was collected for |
| Author(s) / Creator(s) | Who collected or generated the data |
| Date/period of data creation | When the data was collected or generated |
| Software or hardware requirements | What is needed to open or process the files |
Data Dictionaries
For tabular data especially, a data dictionary is invaluable. It explains what each column or variable represents, the data type, the unit of measurement, and the range of valid values. This is critical for any collaborator: including your future self: who needs to work with the data later.
Step 5: Store Data Securely Using the 3-2-1 Rule
Storing files on a modern computer is easy. However, securing them over time and in a sustainable way: avoiding data corruption and data loss: requires following some simple principles.
What is the 3-2-1 Backup Rule?
Valuable data should be stored in accordance with the 3-2-1 backup rule: keep three separate instances of the data: the original and two backups: on two distinct storage devices, such as a local copy on a laptop plus network storage, and one offsite backup at a different location.
| Backup Layer | Example |
| Copy 1 (Primary) | Working files on your personal computer or workstation |
| Copy 2 (Local backup) | Institutional network drive, automatically backed up |
| Copy 3 (Offsite backup) | Cloud storage (institutional or commercial) in a different geographic location |
Access Control and Security
Data security is not just about preventing loss: it is also about controlling who can access sensitive data. Key security practices include:
- Setting appropriate access permissions for team members
- Encrypting sensitive data, especially on portable drives
- Using strong authentication (multi-factor authentication where possible)
- Reviewing and revoking access when team members leave a project
Data Retention
It is best practice to retain research data over time. Data retention requirements may vary by country, funder, data type, or subject domain, but may require a minimum of 10 years, ranging up to over 25 years. Check your funder’s and institution’s specific requirements: and document them in your DMP.
Step 6: Develop a Data Management Plan (DMP)
A DMP essentially outlines data collection, storage, and sharing strategies while describing how privacy, security, and policy compliance are ensured. Usually drafted alongside the project outline, a DMP accompanies the project throughout its lifecycle, maximizing the data’s value and impact.
Core Questions a DMP Should Answer
| DMP Category | Key Questions |
| Collection | What data will be re-used, collected, or created, and how? |
| Description | What formats and types will be collected? What hardware/software is required? |
| Standards | How will data be described to enable effective interpretation? Which metadata standards apply? |
| Policies, Legal & Ethics | Which policies and funder requirements apply? How will legal and ethical compliance be met? |
| Storage & Preservation | How will data be stored, secured, and preserved during and after the project? |
| Access | Who needs access during the project, and what authorization rules apply? |
| Sharing & Reuse | How will data be shared, and under what conditions or licenses? |
| Roles & Responsibilities | Who is responsible for each data-related step? |
| Budget | What financial implications arise from data storage, software, or publication? |
| Quality Control | How will data quality be ensured and monitored? |
Tools to Help Create a DMP
Several free web-based tools can guide researchers through the DMP process and incorporate funder-specific templates:
- DMPTool: widely used in the United States, with templates for NIH, NSF, and other funders
- DMP Online: commonly used in the UK and Europe
- RDMO: supports multiple European funder requirements and provides forms tailored to those requirements
A common pitfall is that researchers may create a DMP initially and then fail to regularly review and update it. This oversight can lead to consequences ranging from increased workload to data mismanagement.
Step 7: Choose Open and Standard File Formats
Open and standard file formats facilitate data handling within research groups and beyond, while ensuring that data remain readable and accessible over time. The practical recommendation: retain your original proprietary file if needed for active work, but always produce an open-format copy for archiving.
Recommended Formats by Data Type
| Data Type | Recommended Open Formats |
| Tabular data / Statistics | CSV, plain text (UTF-8) |
| Text documents | PDF/A (archival), TXT, ODT |
| Images / Photographs | TIFF, PNG, JPEG 2000 |
| Audio | FLAC, BWF (Broadcast Wave) |
| Video | MP4, MKV |
| Containers / Archives | ZIP, TAR |
| Scientific sequences (bioinformatics) | FASTA, FASTQ |
Step 8: Share Your Data: The Open Science Imperative
Data publication refers to making research datasets openly accessible for review and reuse. In the context of the reproducibility crisis, data sharing plays a key role in enabling research reproducibility. Research data publication promotes transparency, credibility, long-term accessibility, reproducibility, and collaboration.
What Data Should be Shared?
| Publish | Do Not Publish |
| Unique datasets difficult to recreate | Test or pilot data |
| Data with high relevance to the scientific community | Discarded or erroneous data |
| Data that is anonymized or safely de-identified | Data with no medium- or long-term relevance |
| Complex or expensive-to-generate data | Data containing unresolvable personal identifiers |
De-identifying Sensitive Data
When human subject data is involved, sharing requires careful de-identification. This includes removing direct identifiers such as names and addresses, aggregating variables such as grouping ages into ranges, suppressing rare values, or adding noise to geographic data. These practices help balance openness with privacy protections under regulations like HIPAA in the US or GDPR in the EU.
Choosing a Data Repository
Researchers can choose a general-purpose data repository or select a topic-specific repository. Common repositories include:
- Zenodo: free, general-purpose, supported by CERN
- Dryad: popular in the life and environmental sciences
- Figshare: supports a wide variety of file types
- Harvard Dataverse: widely used in the social sciences and humanities
- Domain-specific repositories: such as NCBI (genomics), ICPSR (social science), or UK Data Archive
Licenses and Persistent Identifiers
Researchers should choose an appropriate license: such as a Creative Commons or MIT license: to specify how data can be reused. As a rule of thumb, select a license that is as open as possible and imposes minimal restrictions. Data publications should be assigned a persistent identifier such as a DOI to enhance long-term findability and prevent isolation, in line with the FAIR principles.
Practical Implementation: Getting Started
Shifting to manage research data digitally can seem like a daunting task, but it doesn’t need to be. Many organizations and institutions provide research data management support, mostly through research services offices or research librarians.
A Realistic Starting Sequence
Before the project:
- Review funder and institutional data policies
- Seek ethics approval if required
- Draft your DMP
- Set up your folder structure and file naming convention
During the project:
- Apply naming conventions consistently
- Maintain metadata and documentation as you go
- Back up data regularly using the 3-2-1 rule
- Use version control for code and evolving data files
At the end of the project:
- Prepare data and metadata for publication
- Convert files to open formats
- Choose a repository and deposit your data
- Assign a DOI or other persistent identifier
- Update your DMP to reflect what was actually done
Choosing a Digital Tool: Common Approaches
| Approach | Pros | Cons |
| Paper notebooks | Zero cost, no setup | Extremely difficult to find, share, or back up data |
| Shared server folders | Familiar, low cost | Easily loses control with multiple users; hard to search |
| Cloud storage (generic) | Accessible, low cost | Limited structure; depends on individuals to follow conventions |
| Electronic Lab Notebook (ELN) | Structured, searchable, version-controlled | Takes time to set up; may have licensing costs |
Communicating Change Within Your Lab
Communication is key when changing existing practices within the lab. Make sure everyone understands why these changes are happening. This could mean involving lab members in decision making, listening to feedback from those who handle data day-to-day, and providing context and background on why switching to digital data management is important. If only some lab members follow the new practices, the benefits will be drastically reduced.
Continuous Self-Monitoring
Continuous self-monitoring involves setting regular intervals at which to evaluate progress within and beyond projects in order to spot potential problems and ascertain the overall effectiveness of the work. By performing such self-monitoring, researchers can potentially mitigate the risk of getting lost in detail and better visualize goals.
Regular review allows mapping of progress against strategies, schedules, budgets, and other metrics to keep work on track and allow for corrective actions. This can help tackle issues that might be minor today but can compound over time and require excessive resources in the long run.
Suggested checkpoints:
- At project initiation: confirm DMP is complete and the team is aligned
- At each major milestone: review whether data practices are being followed
- Annually: review storage needs, access permissions, and policy changes
- At project close: complete repository deposit and final documentation update
Frequently Asked Questions
Do I need a Data Management Plan even if my funder doesn’t require one?
Yes. A DMP is useful regardless of funder requirements because it forces you to think through data needs proactively, prevents costly corrections later, and ensures your team is aligned. Many researchers who draft DMPs voluntarily report that the process itself surfaces problems they hadn’t anticipated.
What’s the difference between a data repository and cloud storage like Google Drive or Dropbox?
Commercial cloud storage services are designed for active file access and sharing, not for long-term preservation or discoverability. A data repository assigns your dataset a persistent identifier (such as a DOI), indexes it for search, and ensures it remains accessible according to defined standards, sometimes for decades. Cloud storage accounts can be closed, reorganized, or have their terms changed without notice.
How do I handle data that belongs to multiple collaborators or institutions?
Establish a data governance agreement at the start of the project. This should clearly state who owns the data, who has access rights, how it can be shared or published, and what happens to the data if the collaboration ends or a team member moves to another institution. This is particularly important when collaborating across countries with different legal frameworks.
My lab generates very large datasets (terabytes of imaging or sequencing data). Does RDM still apply?
RDM applies especially to large datasets, which are harder to manage retroactively. For high-volume data, it is worth planning storage infrastructure and costs explicitly in your DMP. Some funders allow research data storage costs to be included in grant budgets. Domain-specific repositories often have infrastructure designed for large scientific datasets.
Is anonymized data always safe to share?
Not necessarily. Re-identification risks are real, especially with genomic data, rare disease data, or small geographic populations. Even data that has been processed to remove obvious identifiers can sometimes be cross-referenced with publicly available information to re-identify individuals. For sensitive data, consult your institution’s data protection officer or legal team before sharing, and consider whether controlled-access sharing is more appropriate.
How long should I keep my research data after a project ends?
This varies by discipline, country, funder, and data type. A common minimum is 10 years, but some funders, institutions, or regulatory frameworks require longer retention, in some cases 25 years or more. Clinical trial data often carries extended retention requirements. Check your specific funder’s policy and document the required retention period in your DMP.
What happens to research data when a PhD student or postdoc leaves the lab?
This is a common and underappreciated risk. Before a lab member departs, ensure that all data is properly documented, stored in a shared institutional location (not on a personal device), and that at least one remaining team member can access and understand it. Creating a “knowledge transfer file” or offboarding checklist is a best practice that many institutions now recommend.
Can I use AI tools to help with data documentation or organization?
AI tools can assist with tasks like generating README templates, drafting data dictionaries, or suggesting metadata fields: but the researcher remains responsible for the accuracy and completeness of all documentation. AI-generated metadata should always be reviewed against the actual data. Be cautious about uploading sensitive or confidential datasets to third-party AI services, as this may violate data governance agreements or privacy regulations.

Comment