Project summary

Funding - £200k project via a UKRI call entitled Digital infrastructure: new approaches to skills or software.

Timeline - October 2024 to March 2025

Aim - to achieve a greater level of standardisation of data access APIs across the EDS, with a particular focus on their use in Artificial Intelligence (AI) and machine learning (ML) applications.

Core project team - Patrick Bell (project lead)

About the project

API4AI was funded through a UKRI call entitled Digital infrastructure: new approaches to skills or software. This £200k project was funded via STFC at 80% FEC and ran from October 2024 to March 2025. NOC, EIDC and NGDC were beneficiaries of the funding with CEDA and PDC providing in-kind funding to ensure their domains were considered during the project.

The project aimed to achieve a greater level of standardisation of data access APIs across the EDS, with a particular focus on their use in Artificial Intelligence (AI) and machine learning (ML) applications. Standardising data access APIs will reduce the effort needed by EDS as data publishers and by environmental researchers as data consumers, saving development time and easing data integration processes. This also supports systematic AI analysis of multiple environmental datatypes to underpin development of predictive environmental modelling and digital twins.

A stakeholder engagement workshop was run, as part of the project, to identify use cases where standardised APIs could facilitate AI/ML workflows. Twenty-five use cases were identified, including identifying minerals from core photographs, using historical data to plan future cruise sampling strategy, developing automated methods for detecting landslides, understanding periodicity in environmental sensor networks, updating autonomous vehicles with local data and uncovering cloud covered satellite imagery.

Through co-design and Agile development processes, the project identified and recommended the mlcommons Croissant specification as a common standard to help ML consumers interface between data APIs - and bulk download - of any design. APIs and their Croissant descriptors were developed by each contributing data centre. Demonstrator ML workflow notebooks were also created using the Croissant descriptors and data APIs and these were run on different data science platforms to demonstrate portability.

Project reports

Published outputs from API4AI

Workshop report - https://nora.nerc.ac.uk/id/eprint/539107/

Final project report - https://nora.nerc.ac.uk/id/eprint/539708/

Project GitHub repository - https://github.com/nerc-eds/API4AI

Next steps for API4AI

With appropriate funding, we will:

Extend croissant specification for API support including authenticated access
Extend support for large spatial and temporal datasets through a proposed Geo-Croissant extension to the standard.
Investigate automated creation of Croissant metadata from OpenAPI specification descriptions of data access APIs
Investigate and agree a shared approach within EDS for API registration and authentication
Enable semantic interoperability by defining data attributes using linked data definitions, e.g. from the NERC Vocabulary Server
Create API endpoints or data downloads of statistical data describing distribution of labels and attributes used in data to indicate clustering and degree of balance; this is helpful when using data attributes in classification tasks.
Investigate use of Croissant in workflow platforms to record at a granular level how a dataset was created, processed and enriched throughout its lifecycle, enabling the Responsible AI metrics to be shared.