DNA metabarcoding of forensic mycological samples

DNA metabarcoding and massive parallel sequencing are valuable molecular tools for the characterization of environmental samples. In forensic sciences, the analysis of the sample’s fungal population can be highly informative for the estimation of post-mortem interval, the ascertainment of deposition time, the identification of the cause of death, or the location of buried corpses. Unfortunately, metabarcoding data analysis often requires strong bioinformatic capabilities that are not widely available in forensic laboratories. The present paper describes the adoption of a user-friendly cloud-based application for the identification of fungi in typical forensic samples. The samples have also been analyzed through the QIIME pipeline, obtaining a relevant data concordance on top genus classification results (88%). The availability of a user-friendly application that can be run without command line activities will increase the popularity of metabarcoding fungal analysis in forensic samples.


Background
The analysis of biological components in a specific place can be crucial for the determination of the events that took place at the crime scene. In particular, the identification of non-human biological traces can better clarify transfer events, storage conditions, and the environmental origins of items (Iyengar and Hadi 2014). Environmental samples are characterized by the presence of different biological species that can be identified by DNA sequencing and analysis. The non-human DNA pool present in those samples reflects the biodiversity of a specific milieu. In particular, environmental fungi represent an important group of organisms strictly correlated with decomposition and that require specific living conditions. Specific fungal populations characterize welldefined ecological niches and can be informative about the conservation conditions of items that can be found at the crime scene (Dash and Das 2018). In addition, forensic analysis of the fungal population can be highly informative for the identification of human activities connected to specific jobs, the estimation of postmortem interval, the ascertainment of deposition time, the identification of the cause of death, and the location of buried corpses (Hawksworth and Wiltshire 2011;Di Piazza et al. 2018). The identification of mycological profiles has been presented also as a highly informative tool and probative evidence in crime reconstruction with real caseworks (Wiltshire and Hawksworth 2014).
In the last few years, the analysis of biological matrices through metabarcoding and high-throughput sequencing has changed classical approaches to biodiversity estimation. New protocols have been developed in basic science fields, and they are now becoming available to applied biomedical areas like public health and forensic science (Giampaoli et al. 2017;Littlefair and Clare 2016;Young et al. 2017;Bianchi et al. 2018). The efficacy of metabarcoding analysis is strictly correlated with the availability of DNA reference libraries and robust bioinformatic tools (Taberlet et al. 2012). For prokaryotic analysis, the sequencing of the 16S ribosomal RNA gene (rRNA) is considered a valuable approach unless the researchers want to study the intraspecific differences. Several 16S rRNA gene databases (like Greengenes, SILVA, RDP) are available and incorporated in software packages for bacteria metabarcoding (DeSantis et al. 2007;Glöckner et al. 2017;Cole et al. 2014).
For eukaryotes, other databases have been suggested, in general trying to focus on the most informative DNA sequence regions for each kingdom. For Fungi, a specific database, named UNITE (http://unite.ut.ee), is based on the internal transcribed spacer (ITS) region and is available and ready to be used with several software packages like QIIME or mothur (Kõljalg et al. 2013). Unfortunately, QIIME and mothur are software packages for bioinformatics data processing that require informatic expertise and experience in command line operations (Oulas et al. 2015). This constrain strongly reduces the expansion of fungal metabarcoding in applied medical fields where bioinformatic resources are not easily available.
Recently, certain cloud pipelines have been generated (e.g., PipeCraft) that try to meet higher flexibility and ease-of-use standards (Anslan et al. 2017). Nevertheless, additional simplification of workflows is required for next-generation sequencing data analysis, and userfriendly applications are strongly desirable in order to support standardization and incorporation of fungal metabarcoding into routine protocols.
The present paper describes a new user-friendly application for the analysis of fungal metabarcoding data. This application is the adaptation of a previous cloudbased prokaryotic tool and can represent a useful and rapid instrument for the analysis of fungal biodiversity in forensic samples. The aim of the paper is not to perform a validation of a bioinformatic algorithm/pipeline, but the description of the versatility of existing informatic tools with a specific application with the UNITE database. Hopefully, the informatic steps adopted for the generation of the new application could be implemented with other curated sequence reference DNA datasets, like those for plants, protozoa, or animals.

Samples and DNA extraction
A total of 16 samples were considered for this study: four food samples with mold, eleven environmental (indoor and outdoor) samples, and one commercially available fungal sample, brewer's yeast, as a positive control (Table 1). Samples 7, 12, 18, and 22 are from a kitchen in an apartment. Samples 11, 34, 35, 53, and 54 are from different rooms in a country house. Sample 29 is from an office building. Sample 51 is from a carpentry building. Samples 23, 26, 30, and 31 are from outdoor environments. All samples were collected using ForensiX swabs (Prionics AG, Zurich, Switzerland) with the exception of spoiled bread, brewer's yeast, and soil specimens that were directly processed starting from tiny amounts (less than 250 mg).
These samples have been chosen mainly as an example of items characterizing specific milieus and containing fungi. Vegetables with mold are largely present in a specific working area (food storages, agriculture, kitchens), while other swabs or items can characterize specific indoor environments of possible crime scenes. They are not strictly correlated to human death, but mainly to environmental characterization.
DNA was purified according to Giampaoli et al. (2012) with minor modifications: in particular, after physical disaggregation of the samples, an incubation (30 min at 30°C) with Lyticase (Sigma-Aldrich, St. Louis, MO, USA) at a final concentration of 2.5 U/μl was performed.

Library preparation and NGS sequencing
NGS samples were prepared according to the "16S Metagenomic Sequencing Library Preparation" guide (Part# 15044223 rev. A; Illumina, San Diego, CA, USA), with slight modifications. In particular, the following primers for fungal ITS region 1 (ITS1) and containing Illumina overhang adapters were used: ITS1-F (Gardes and Bruns 1993) and ITS2 (Jumpponen and Jones 2009) (expected amplicon size 320 bp) ( Table 2). A limited number of samples were amplified also with primers ITS3 and ITS4 (White et al. 1990) (expected amplicon size 520 bp) ( Table 2). The annealing condition for the amplicon PCR was identified through gradient PCR on genomic DNA of Candida albicans and agarose gel electrophoresis visualization: the annealing at 57°C was selected in order to guarantee an adequate yield while also starting from a very low amount of sample. Libraries were quantified through Qubit Fluorometric assay (Thermo Fisher Scientific, Waltham, MA, USA) and library size validated through agarose gel electrophoresis. After library normalization and pooling (according to the "16S Metagenomic Sequencing Library Preparation" guide), a 2 × 251 sequencing run was performed on a MiSeq FGx desktop sequencer (Illumina) in research use only (RUO) mode. The resulting sequencing data was demultiplexed using the bcl2fastq software (Illumina) before uploading of fastq data files to Illumina's cloud platform: BaseSpace Sequence Hub (BSSH; basespace.illumina.com).

Bioinformatics
The BSSH application (app) named "16S Metagenomics v1.0.1" performs taxonomic classification of 16s rRNA sequences using a naive Bayesian classifier as described by (Wang et al. 2007;Illumina 2014). Briefly, the naive Bayesian classifier searches different kmer lengths against the corresponding database and applies statistical tests to find significant hits. Here, the application was modified by the substitution of the default 16S rRNA database (Green-Genes) with the eukaryotic ITS database (UNITE) (Kõljalg et al. 2013;Abarenkov et al. 2010). To perform database substitution, custom scripts were used to combine the UNITE reference dynamic files (sh_taxonomy_qiime_ ver7_dynamic_31.01.2016.txt and sh_refs_qiime_ver7_dy-namic_31.01.2016.fasta; the default threshold value of the dynamic file is 98.5%) into a single tab-delimited file. The new application with the ITS database has been named "ITS Metagenomics v1.0.0." The previously uploaded fastq files were then processed by the "ITS Metagenomics v1.0.0" app.
The same dataset was also analyzed with QIIME pipeline version 1.9.1 for comparative purposes (Caporaso et al. 2010). The read data in fastq format were subjected to default quality processing and removal of singleton reads. The taxonomic classification of the qualityprocessed reads was based on the open reference clustering of sequences into operational taxonomical units (OTUs), using the UCLUST method (Edgar 2010). In the open-reference OTU picking process, reads were clustered against a reference sequence collection, and all the reads which did not hit the reference sequence collection were subsequently clustered de novo. The read clusters were BLAST assigned to taxonomies using the UNITE reference database (release 31.01.2016).

Results and discussion
In order to make available a user-friendly application for fungal metabarcoding to the forensic scientific community, the "ITS Metagenomics" tool was added to the bioinformatic apps available on the Illumina cloud platform BSSH. The fastq.gz files generated by the MiSeq FGx desktop sequencer were directly uploaded to BSSH and, after the identification of reads as paired-ends, analyzed by the "ITS Metagenomics" tool. The application generates single-sample reports and an aggregate summary. For each sample, the report file summarizes classification statistics at different taxonomic levels (kingdom, phylum, class, order, family, genus, species) and shows the relative abundance of the classification results through an interactive sunburst chart together with bar and pie charts. The report, with exception of the interactive sunburst chart, is available also in a pdf file. The application generates a csv file containing the classification results for each single sample.
The aggregate summary presents the beta-diversity analysis as pdf or html files, together with the csv files containing the classification results at all taxonomic levels: an interactive principal coordinate analysis chart (PCoA) and a hierarchical clustering dendrogram show the differences in the distribution of taxonomic classifications between samples.
The availability of report files containing beta-diversity analysis is particularly useful in forensic science. One of the main questions for forensic scientists is about the similarity of two or more samples. The PCoA chart visualizes similarities or dissimilarities of data and helps the identification of subsets of specimens, allowing the forensic scientist to speculate if two samples are strictly correlated (e.g., because they have been collected in similar environments) or not. The "ITS Metagenomics" tool can generate a PCoA chart at all taxonomic levels (kingdom, phylum, class, order, family, genus, species). The application version described in the present paper was built on the 2016 release of the UNITE database, but future updates will be scheduled. The database can support the analysis of both ITS1 and ITS2 regions. The selection of the sequencing primers for fungal metabarcoding should reflect technical limits and taxonomic needs. According to some papers, the ITS2 region should be less affected by amplification bias, but at the same time, the larger size and the variability in length can represent a limit for several sequencing platforms (Turenne et al. 1999;Op De Beeck 2014). Indeed, shorter fragments are much more suitable for metabarcoding assays. Considering that the aim of the present work was not the identification of the best barcoding sequence in fungi and that a genus-level classification can be acceptable for the definition of a microbiological signature in many environmental samples, the adoption of primers ITS1-F and ITS2 was considered. In addition, few samples were also sequenced with primers ITS3 and ITS4, obtaining comparable outputs (data not shown).
Considering that fungal contamination can be linked to item storage circumstances, microclimatic conditions, and environmental ecosystems, specific samples from typical indoor and outdoor locations were collected for this technical evaluation. In particular, samples from furniture, food, soil, indoor surfaces, and positive control were collected according to usual forensic procedures (Table 1): samples from soil and indoor or outdoor surfaces are usually collected for crime reconstruction, while food traces can be present as contaminants on clothes worn by victims or suspects. For all samples, it has been possible to generate libraries with the selected primers for the ITS1 region, and sequence files were analyzed on BSSH: the minimum number of reads passing quality filtering per sample was approximately 60,000 while the maximum was close to 2,000,000 reads (Table 3). The high variability in the number of reads was probably the consequence of library quantification through gel electrophoresis: nonetheless, the number of reads obtained appears sufficient for metagenomic surveys of the selected samples or for the estimation of predominant taxa (Ni et al. 2013). The number of species identified ranged from 208 to 900, while the number of genera ranges from 127 to 566.
The metabarcoding results showed the presence of fungal genera that typically colonize specific environments or surfaces (Fig. 1). For example, Wallemia sebi, largely found on sample 18 (spoiled bread with mold), is a xerophilic fungus commonly found on highly sugared or salted foods, such as jams, bread, or cakes, and for this reason can be considered a possible marker for places where food is stored (Zajc and Gunde-Cimerman 2018). Sample 22 (spoiled pear) was characterized by the predominance of the genus Hanseniaspora, an organism associated with fruits and blossoms of different fruit trees (Vadkertiová et al. 2012). Sample 51 (floor of a woodworking) was characterized by the predominance of the genus Aureobasidium: this fungus colonizes leaf surfaces of several trees and various wood surfaces that Table 3 Statistics of taxon identification obtained through BSSH and QIIME (in brackets). The Shannon Species Diversity Index is reported as a measure of species diversity in each sample. The QIIME analysis shows a reduced number of species and genera identified when compared to BSSH can be easily found in a woodworking area (Andrews et al. 2002;van Nieuwenhuijzen et al. 2016).
In order to compare the new tool with other available software tools, the sequence files of each sample were also analyzed with the QIIME pipeline, and the results were compared with those obtained with the BSSH "ITS Metagenomics" app. The comparison of two different bioinformatic software tools can raise concerns about computational irreproducibility and database usage (Di Tommaso et al. 2017). In this situation, the two software adopted the same database release, but differences in the classifier can lead to variations in unclassified reads and in correct assignment at the taxonomic level. While this problem cannot be easily resolved, it is reasonable to expect that two different bioinformatic tools can lead to similar results especially in a sample with a reduced number of unclassified reads.
The QIIME analysis showed a reduced number of species (ranging from 394 to 88) and genera identified (from 256 to 61; see Table 3). Figure 1 represents the classification results at the genus level for the 16 samples of the study, both through QIIME and through BSSH analysis. The genus-level classification is widely adopted for the comparison of environmental matrices; in addition, the analysis at a species level can be inaccurate in a single genetic locus approach (Thomas et al. 2012;Moussa et al. 2017;Rotimi et al. 2018). About the samples described in the present paper, when not considering the unclassified and unidentified reads, in 11 samples (69%), the most abundant genus identified was the same with both software tools. The concordance percentage was higher (88%) when considering the concordance of the first genus or at least of two genera on the three most abundant. The positive control was correctly analyzed with both bioinformatic tools: Saccharomyces was identified as the predominant genus both with BSSH (90% of assigned reads) and QIIME (99%). Only the analysis of sample 12 (green tomato with mold) and sample 35 (clothes brush with mold) showed a reduced concordance. It is important to notice that the QIIME analysis of sample 12 showed more than 80% of unclassified reads at the genus level, while with BSSH, this datum is 11.5%: this output has probably affected the correct investigation of the sample diversity. Similar considerations can be done for sample 35 where approximately 70% of reads were unclassified with both approaches. Nonetheless, a more in-depth examination of the identified genera in sample 35 underlined many concordances between the two analyses, showing the presence of Cryptococcus, Exophiala, Cyphellophora, and Cladosporium amongst the six most representative taxa in both situations (Table 4). While the presence of a large quantity of unclassified or unidentified reads does not automatically mean that the analysis is Fig. 1 Classification results at the genus levels for all 16 samples, performed through BSSH (BS) or QIIME (Q). The y-axis represents the percentage of total reads compromised, it looks clear that it enhances uncertainty, suggesting additional investigations: the exclusion from the present study of those samples with a high level of unclassified reads leads to an increased concordance between the two approaches. Sample 7 (a swab of a spoiled tomato) presented the largest percentage of unclassified reads at the genus level. The sample was collected from a vegetable largely decomposed. It is possible that the swab was highly contaminated from tomato cells: in this situation, the PCR amplification had reduced stringency due to the large presence of plant DNA: the NGS was able to clean up the datum, working only on the sequences with correspondence in the UNITE database.

Conclusions
At the moment, many different software solutions are available for NGS metabarcoding (or marker gene metagenomics) data analysis. This is a large and rapidly moving field where specific informatic programming capabilities are strongly required. Commonly used tools for metabarcoding data analysis and denoising include QIIME, Mothur, SILVAngs, MEGAN, and Amplicon-Noise (Oulas et al. 2015). QIIME was initially implemented for use of 454 pyrosequencing datasets, but it has been then modified to accept the Illumina fastq file format. Unfortunately, this open-source software package is mostly implemented with the PYTHON language, not easily accessible to current forensic scientists. For this reason, user-friendly applications are strongly desirable in order to support metabarcoding studies in forensics. The BSSH application named "ITS Metagenomics" can represent an interesting tool for rapid analysis of fungal biodiversity in several environmental samples, including forensic caseworks. When compared to the QIIME classification, the results look similar: differences can be ascribed to variances in the two informatic pipelines or caused by the large presence of unclassified reads in specific samples (Zajc and Gunde-Cimerman 2018). This latter point could be solved in the future, through the adoption of new releases of the UNITE reference database. Interestingly, the "ITS Metagenomics" app seems able to identify a larger number of taxa per sample. In our opinion, this first release of the application can represent an important starting point for data analysis in forensics, and informatic steps adopted for the implementation of UNITE database in the app can be suitable for other reference databases. In addition, its presence on the BSSH cloud is fully compatible with the implementation of automatic analysis of data generated from Illumina sequencers like the MiSeq system.
The importance of bioinformatic applications for NGS in forensic science has been recently underlined (Allwood et al. 2020a;Allwood et al. 2020b;Giampaoli et al. 2018). The analysis of biological material, even from trace amounts in dust, has clear advantages for specific forensic applications, in particular, when we need to discern sample origin or geolocation (Allwood et al. 2020a). Recently, several authors focused their studies on fungal succession during mammalian cadaver decomposition, underlining the value of performing mycological analysis with NGS technology in the burial context (Fu et al. 2019;Procopio et al. 2020). Interestingly, metabarcoding applied to diatom DNA can support forensic discrimination of drowning incidents in postmortem examinations (Liu et al. 2020). In addition, NGS metabarcoding applied also to other taxa can be useful in linking specific samples to ecological habitats (Fløjgaard et al. 2019).
As a final consideration, the correct interpretation of metabarcoding data in legal disputations should be standardized by the definition of international guidelines, developed after a large empirical interlaboratory evaluation. This point will be crucial for the diffusion of the metabarcoding NGS analysis in forensic science.