Research Data Management: The Unsexy Problem Nobody's Solving
Research data management is the unglamorous infrastructure problem that Australian institutions acknowledge without adequately addressing. Researchers generate petabytes of data annually from experiments, observations, and simulations. Much of that data is poorly organized, inadequately documented, and effectively lost to future research despite storage costs to maintain it.
The scale is substantial. A single genomics lab might generate 50 terabytes annually from sequencing machines. Astronomical observatories produce terabytes nightly. Climate models generate petabytes from simulation runs. Environmental monitoring networks stream continuous data from thousands of sensors. The accumulation is relentless and accelerating.
Storage infrastructure struggles to keep pace. Universities provide researchers with limited server space, typically measured in terabytes per research group. When researchers exceed allocations, they resort to external hard drives scattered across desks, personal cloud storage accounts, or simply deleting old data to make room for new. These solutions work poorly for long-term data preservation or collaborative access.
Cloud storage offers scalability but costs accumulate. AWS, Google Cloud, and Azure charge for storage and data transfer at rates that become substantial for large research datasets. A research project storing 100 terabytes costs thousands annually in storage fees alone. Most research grants don’t budget adequately for long-term data costs, creating sustainability problems when project funding ends.
Metadata—data about data—is consistently poor. Researchers know what their data represents while actively working with it, but documentation degrades over time. A hard drive labeled “experiment 2022” recovered three years later contains cryptic filenames and undocumented file formats that even the original researcher struggles to interpret. Systematic metadata standards exist but aren’t followed consistently.
Data sharing is rhetorically supported but practically difficult. Funding agencies and journals increasingly require data availability, but infrastructure to actually share data remains limited. Some fields have established repositories with standardized formats and robust infrastructure. Others have ad-hoc solutions where researchers email files or post them to personal websites that disappear when people change jobs.
The FAIR principles—Findable, Accessible, Interoperable, Reusable—provide ideals for research data management. Australian institutions acknowledge these principles in policy documents while implementing them minimally. Making data truly FAIR requires investment in infrastructure, metadata standards, training, and ongoing curation that competes with other priorities.
Several universities employ data management specialists to help researchers navigate these challenges. These positions are valuable but chronically understaffed—one data librarian supporting hundreds of researchers across multiple disciplines can provide guidance but not hands-on help for everyone who needs it.
Discipline-specific repositories work well where they exist. Genomics has GenBank and related databases. Crystallography has the Protein Data Bank. Astronomy has established archives for telescope data. These repositories serve critical functions for their communities, but most research fields lack equivalent infrastructure.
The reproducibility crisis connects directly to data management failures. When published research can’t be reproduced, missing or poorly documented data is often the problem. Other researchers can’t access the original data or can’t understand how to use it properly. Better data management would improve reproducibility substantially.
Sensitive data creates additional complexity. Medical research, Indigenous knowledge, commercially relevant data, and national security-related information all require access controls beyond open sharing. Managing permissions, ensuring security, and maintaining privacy while enabling research use requires sophisticated systems that many institutions lack.
Legacy data poses particular challenges. Research projects from 5-10 years ago often used file formats or software that are now obsolete. Migrating old data to current formats requires effort and expertise that few institutions commit to systematically. The result is accumulating digital artifacts that are theoretically preserved but practically unusable.
The Research Data Storage Infrastructure project aimed to address some of these issues nationally. The initiative provided storage infrastructure and coordination across institutions. But funding limitations mean the infrastructure serves only a fraction of Australian research data management needs. Many researchers still rely on improvised solutions.
Personal incentives don’t favor good data management. Researchers are rewarded for publications, not for well-organized datasets. Time spent documenting and organizing data is time not spent writing papers or grants. The rational career calculation often deprioritizes data management, even when researchers recognize its importance.
Some research areas handle data management better than others. Large collaborations like particle physics experiments or astronomical surveys develop sophisticated data management from necessity—without it, the collaborations couldn’t function. Smaller research groups working independently often lack resources and expertise for robust data management.
Industry partnerships sometimes improve data management because companies require it. Researchers working with commercial partners must document data properly to be useful beyond academic publication. Those requirements create overhead but also instill discipline that benefits the research.
Machine learning and AI research face particular data management challenges. Training datasets can be enormous, and model reproducibility requires preserving not just final models but training data, intermediate checkpoints, and version control. The infrastructure needs are substantial and growing rapidly.
International data sharing creates jurisdictional complications. Data privacy laws differ across countries, affecting what can be shared internationally. Australian researchers collaborating globally must navigate Australian privacy law, overseas regulations, and institutional policies that don’t always align coherently.
The cost-benefit calculation for data preservation is unclear. Storing all research data indefinitely would be extremely expensive. But deciding what data deserves preservation and what can be discarded is difficult without knowing what future researchers might find valuable. The tendency is to save everything, which is unsustainable, or to delete aggressively, which loses potentially valuable information.
Several Australian institutions are developing research data management policies with teeth—requiring data management plans for all research grants and refusing ethics approval without proper data plans. Whether these policies improve practice or just generate more paperwork remains to be seen.
Cloud-native research is emerging as a potential solution. Conducting research entirely in cloud environments with integrated version control, documentation, and archiving might solve several data management problems. But not all research fits cloud workflows, and cloud dependency creates its own concerns around vendor lock-in and long-term access.
For 2026, data management challenges will intensify as data generation continues accelerating faster than infrastructure development. Incremental improvements are happening—better tools, more awareness, some additional funding. But the gap between needs and capabilities is probably widening rather than closing.
Research data represents enormous investment of public funding and researcher effort. Losing or degrading that data through poor management is waste that serves nobody. Solving this requires treating data infrastructure as seriously as physical research infrastructure, with corresponding investment and professional management. Current approaches fall short of that standard.