Bridging the gap between Environmental Data and AI

Machine Learning is transforming environmental science, but data fragmentation remains a major barrier. We are redesigning how we deliver data to empower the next generation of AI-driven insights.

We're Listening

Today, users face a complex landscape: too many different formats, disparate discovery portals, and legacy delivery systems. Crucially, none of the current formats were specifically designed with Machine Learning (ML) workflows in mind.

For an ML engineer, simply finding and loading a dataset into a training pipeline can take 80% of their project time. We need to move beyond "data for humans" to "data for algorithms."

Some barriers include:

  • Too many formats
  • Hard to discover
  • No ML-native support

We have been reading your feedback, conducting surveys and holding targeted interviews.
You’ve told us about your frustrations and we have been exploring ways to address them.

One potential user noted that "column names" (labels) are inconsistent, that even simple fields like "First Name" vs. "Surname" vary across thousands of datasets, making automated integration impossible without human intervention. They speculated that "AI was going to solve the problem, but it doesn't pull back the data with a sufficient level of confidence."

Another explained that their proprietary model formats lack the metadata needed for AI comparison, so they are actively pivoting the team toward Zarr and GeoParquet because standard NetCDF files replicate metadata in every chunk, making them "inefficient" for training large ML models.

These comments, along with many others, have motivated us to prioritise ‘Standardisation & Interoperability’ as our foundation for AI.

What's available now and coming soon

Live and Ready In Testing Longer term
  • Python notebooks for common data tasks
  • Croissant metadata files (limited implementation)
  • Plain-English data descriptions (we are always looking to improve this)
  • Croissant metadata automation
  • AI chatbot for data discovery
  • Multi-domain integration workflow
  • Enhanced NERC vocabularies
  • Fully integrated NERC data catalogue with Croissant metadata files
  • AI workflows embedded in external-facing agentic chatbot

Why Croissant?

We are actively undertaking a review of the EDS roadmap with the general aims being to enhance the availability of high value data in accessible, AI-ready formats for use in innovative emerging technologies for use in solving societal challenges within and beyond the environmental sector.

Official Croissant Guidance

Standardising Discovery with Croissant

Our immediate priority is providing a single, machine-readable way to understand what EDS data is available for ML. We have adopted Croissant, a high-level format that bridges the gap between scientific archives and ML frameworks.

It allows tools like TensorFlow, PyTorch, and Hugging Face to load our data seamlessly by providing standardised metadata about dataset structures, variables, and licensing.

Variations in workflow

While we move toward a unified discovery layer, our data remains hosted across specialised centres, using established protocols optimised for their respective domains. Understanding these variations is key to accessing the underlying raw data.

Image showing a database or server icon in front of an ocean background
Tabular & Gridded Datasets (ERDDAP)
A scientific data server that provides a simple, consistent way to download subsets of datasets in common file formats. ERDDAP allows for precise sub-setting and conversion across NetCDF, CSV, and JSON.
A image showing a satellite icon displayed in front of a picture of the earth from space
Satellite & Model Gridded Data (STAC)
Used for spatio-temporal datasets, such as satellite data from the Sentinel missions or gridded climate model data. STAC presents metadata in a standardised way that can be programmatically queried and searched.
a map icon display over a bird-eyes view image of some hills
Feature & Vector Services (OGC APIs)
Modern geospatial APIs for querying specific vector features (WFS/API Features) or map tiles for visualisation.

Python Implementation Examples

Jumpstart your development with our curated collection of Jupyter Notebooks, demonstrating real-world access patterns for each service.

Image showing a database or server icon in front of an ocean background
Notebook 1: ERDDAP Access
Learn how to programmatically subset and download ocean sensor data.
A image showing a satellite icon displayed in front of a picture of the earth from space
Notebook 2: STAC Discovery
Search and visualise cloud-native Zarr files from the ESA CCI collection using both STAC and Croissant.
a map icon display over a bird-eyes view image of some hills
Notebook 3: OGC API Patterns
Query vector features and integrate map tiles into a web application.
Stacks or layers icon in front of an image of geological terrain
Notebook 4: UK Historical Earthquake Catalogue
Access the UK historic earthquake catalogue from an OGC API vith croissant abstractions.
wind icon in front of an image of an offshore wind farm
Notebook 5: Holistic Windfarm Siting
A complex use case combining data from ERDDAP, STAC, and OGC. Integrates with 3rd party APIs provided by the Crown Estate.
a building and tree icon displayed in front of a cityscape with natural blue and green spaces
Notebook 6: Blue-Green Spaces
Using data from the NERC data centres, compare two locations. and find the most suitable area for the potential development of blue/green spaces.

A Unified Environmental Data Service

Our vision is a truly uniform way to discover and understand UKRI environmental data. While Croissant is our bridge today, the future involves deeper structural changes to simplify the user journey.

Federated Single Instances

Consolidating existing centre-specific services into unified EDS instances (e.g., a single ERDDAP endpoint) to reduce complexity.

High-Value Standardisation

Migrating our highest value datasets to a smaller set of modern, high-performance underlying formats.