Creation of shared tools and user interfaces allows data providers, from across all environmental areas, to use a single tool to deposit large data. Some of these data were previously impossible to archive as the storage capability at the individual data centres simply wasn't big enough.
Sharing infrastructure for large datasets delivers economies of scale in cost and energy usage.
Satellites and climate models are renowned for their ability to produce petabytes of data and quickly fill up storage systems. The Centre for Environmental Data Analysis (CEDA), the atmospheric and earth observation component of the NERC Environmental Data Service (EDS), hosts over 20 Petabytes of data and handles the largest datasets compared to the other data centres. However, big data is not just a problem for the CEDA Archive; other data centres within the EDS are increasingly being asked to archive large datasets that simply do not fit into their existing data storage systems.
To help address this problem, the NERC EDS commissioned an integration activity to develop a process whereby all data centres could use the storage facilities used by the CEDA Archive to store their big data. The CEDA Archive storage facilities are hosted on JASMIN - a globally unique data analysis facility for environmental science. It has tens of petabytes of storage that supports a range of scientific workflows - including providing the infrastructure for the long-term data archive function of CEDA.
The new process would mean that CEDA’s storage resources would be shared across the EDS, allowing for larger datasets to be archived. This wasn’t previously possible within the other existing data centres infrastructures. Sharing JASMIN’s infrastructure for large EDS datasets delivers economies of scale in cost and energy usage. This approach will also mean that future additional storage will only need to be added to the JASMIN infrastructure, rather than all five individual data centres' storage systems.
In 2020, CEDA adapted its data arrivals service to accommodate data from any of the other centres. This tool uses a point and click web interface, where any researcher can upload their large datasets for submission with a different EDS data centre. Previously, researchers who wished to deposit large datasets at other data centres were unable to as the storage capability simply wasn’t big enough. Researchers are now able to use this new process, allowing for large datasets to be archived at CEDA but catalogued at the relevant data centre.
The tool is currently in beta testing, with a small number of datasets stored at CEDA but catalogued in the National Geoscience Data Centre. Regular meetings with EDS data centre representatives have been occurring, to ensure testing of the new service is suitable for all use cases. These meetings also allow for improvements to be suggested and actioned. Future work will involve further testing of the process to ingest and store data from all the other data centres within the NERC EDS.