This deliverable is based on a report that contains an inventory of DNA barcoding projects across Europe, as well as repositories containing DNA barcode reference libraries, identifying their geographical, ecological, and taxonomic coverages. Such libraries (sensu Weigand et al. 2019), containing reliable and taxonomically curated DNA sequences, are essential for biodiversity monitoring, as they allow the utilisation of ecological and biogeographical information on species in a comparable manner.
The report includes a brief description of the methodologies used to perform the inventory. In order to identify DNA barcoding projects and DNA reference libraries of European aquatic biodiversity, the main approach that was followed was a survey targeting peers who are, or have been involved in relevant projects in Europe. Additionally, an experimental approach was investigated, using the development of a large language model (LLM) for automatic processing of publications’ content. The
collected content was explored using a list of search criteria agreed upon by the consortium members which allows for analysing gaps and overlaps of (meta)data.
The survey was intended to directly gather information from key actors in the development of DNA reference libraries of European aquatic biodiversity, increasing the likelihood of acquiring data from current projects that have yet to publish or deposit any information. Yet, its outcome is biased as it was largely dependent on the peers that responded to the survey. The experimental LLM-based approach aimed to provide an automated method for reviewing available publications, allowing for a more timeefficient data collection process. This approach also has limitations resulting from the content of the training dataset and the set of questions on which the LLM was based. Combining the two approaches facilitates a more comprehensive review of the existing libraries.
