Bridging the gap between Environmental Data and AI
Machine Learning is transforming environmental science, but data fragmentation remains a major barrier. We are redesigning how we deliver data to empower the next generation of AI-driven insights.
We're Listening
Today, users face a complex landscape: too many different formats, disparate discovery portals, and legacy delivery systems. Crucially, none of the current formats were specifically designed with Machine Learning (ML) workflows in mind.
For an ML engineer, simply finding and loading a dataset into a training pipeline can take 80% of their project time. We need to move beyond "data for humans" to "data for algorithms."
Some barriers include:
- Too many formats
- Hard to discover
- No ML-native support
We have been reading your feedback, conducting surveys and holding targeted interviews.
You’ve told us about your frustrations and we have been exploring ways to address them.
One potential user noted that "column names" (labels) are inconsistent, that even simple fields like "First Name" vs. "Surname" vary across thousands of datasets, making automated integration impossible without human intervention. They speculated that "AI was going to solve the problem, but it doesn't pull back the data with a sufficient level of confidence."
Another explained that their proprietary model formats lack the metadata needed for AI comparison, so they are actively pivoting the team toward Zarr and GeoParquet because standard NetCDF files replicate metadata in every chunk, making them "inefficient" for training large ML models.
These comments, along with many others, have motivated us to prioritise ‘Standardisation & Interoperability’ as our foundation for AI.
What's available now and coming soon
| Live and Ready | In Testing | Longer term |
|
|
|
Why Croissant?
We are actively undertaking a review of the EDS roadmap with the general aims being to enhance the availability of high value data in accessible, AI-ready formats for use in innovative emerging technologies for use in solving societal challenges within and beyond the environmental sector.
Standardising Discovery with Croissant
Our immediate priority is providing a single, machine-readable way to understand what EDS data is available for ML. We have adopted Croissant, a high-level format that bridges the gap between scientific archives and ML frameworks.
It allows tools like TensorFlow, PyTorch, and Hugging Face to load our data seamlessly by providing standardised metadata about dataset structures, variables, and licensing.
Variations in workflow
While we move toward a unified discovery layer, our data remains hosted across specialised centres, using established protocols optimised for their respective domains. Understanding these variations is key to accessing the underlying raw data.
Python Implementation Examples
Jumpstart your development with our curated collection of Jupyter Notebooks, demonstrating real-world access patterns for each service.
A Unified Environmental Data Service
Our vision is a truly uniform way to discover and understand UKRI environmental data. While Croissant is our bridge today, the future involves deeper structural changes to simplify the user journey.
|
Federated Single Instances Consolidating existing centre-specific services into unified EDS instances (e.g., a single ERDDAP endpoint) to reduce complexity. |
High-Value Standardisation Migrating our highest value datasets to a smaller set of modern, high-performance underlying formats. |