The British Oceanographic Data Centre (BODC) has established a new robust workflow for archiving and publishing high volume data on CEDA (Centre for Environmental Data Analysis). CEDA is one of the NERC Environmental Data Centres and is the national data centre for atmospheric and earth observation research. Due to the general large volume of these data types, CEDA’s archive is hosted on JASMIN’s large data store. JASMIN provides data analysis facilities in order to support data-intensive environmental science, so that scientists can work collaboratively with large datasets. 

Under the NERC Big Data Store initiative, the NERC data centres have all been given space on CEDA’s archive so they can archive high volume datasets that they wouldn’t typically be able to handle. High volume, in this case, is classed as over 500 GB for a single dataset. 

The new workflow enables metadata to be captured and curated in our data submission app, which then creates a unique accession number that is the unique identifier of our archive system. The submitter can then upload their files on to CEDA’s arrivals area for a BODC data manager to access and check the files, and ultimately archive them using BODC’s accession number. Once the files are archived, BODC will proceed with publishing a digital object identifier (DOI) for the dataset using the metadata input in the data submission app. The BODC data manager can then run a script within CEDA’s systems referencing the DOI unique identifier, and it will pull the metadata published in the DOI and automatically create a landing page in CEDA’s catalogue system.  

NERC is pushing to archive and publish key model outputs generated under NERC funded projects, due to their increasing importance in the science community. We are very pleased to be able to publish these datasets, which are typically TBs in size, with DOIs so the data can be cited in papers. Using DOIs ensures that the data creator is properly credited and makes the data findable for users and citable in research papers.  

Over the last year, BODC has archived 37 datasets at CEDA, adding up to over 196 TB worth of data. The data range from model output to seismic data and LiDAR scans. Some of the datasets included are the ORCHESTRA NEMO model runs of the Southern Ocean (https://catalogue.ceda.ac.uk/uuid/5297667993904b12b0f8cbd8400ab56b) . There has been quite a bit of trial and error while getting the workflow up and running, but our new workflow has put BODC one step closer to automating data delivery from data submission to publication.