Genotyping by amplicon sequencing for standardized genetic monitoring of forestry resources using oak and Norway spruce as example

Overview

Methods for genotyping are widely used for the characterization of forest resources. They are used for provenance research, identification and control procedures, estimation or measurement of genetic variation, as well as genetic monitoring. A method that is easy to handle and reproduce is especially desirable for genetic monitoring. However, existing methods are currently mostly based on an indirect determination of allelic characteristics of co-dominant markers, and new possibilities of high-throughput sequencing methods (HTS) are not being applied.

This project establishes the method SSR-GBAS, which consists of protocols for marker development, amplification and indexing of samples, as well as bioinformatics determination of alleles ("allele calling") by incorporating the entire sequence information.

The data are automatically transformed into a matrix and, if necessary, integrated into matrices of existing datasets. By improving the throughput while simplifying the analysis methods and developing a user-friendly software interface, this method enables extensive application for routine genetic monitoring. The method is demonstrated using oak and spruce as examples.

Method - SSR-GBAS

The working group has established the SSR-GBAS method and introduced it for scientific investigation in various systems. The method consists of protocols for marker development, amplification and indexing of samples, and bioinformatic determination of alleles (allel call) by incorporating the entire sequence information and automatically transforming it into a data matrix with the possibility of completing existing matrices.

SSR-GBAS is based on standardized amplicon sequencing using Illumina technology, coupled with a bioinformatics pipeline that separates the different sequences of the amplicons of a locus and reconstructs the genomic sequences corresponding to the alleles. The innovation of our approach compared to similar work is a reduced error rate within the group of reads used to determine the original sequence, achieved through length fractioning. This facilitates the definition of individual sequences by frequency and is also possible in heterozygous individuals. Our pipeline also simplifies demultiplexing and includes options for visual error control.

The markers are identified from genomic resources and amplified in PCR using primers extended with direction-specific adaptors as a pool (currently tested up to 12 loci). All PCR products are pooled and provided with Illumina recognition sequences as well as an index sequence for sample identification in a second PCR. Up to 1000 samples are pooled for 50 markers and read as paired-end reads on a MiSeq run, then standardized per index within the "demultiplexed" procedure. Python scripts combine the sequences per sample and divide them according to their primers. After separating the sequences by length, the alleles of a locus within a sample can be determined. The scripts combine the alleles created in this way into a matrix and create an allele list, which can be used as input for future analyses. This enables automation according to traditional methods of fragment analysis, but with improved throughput. Next, the allele list and sequences are loaded into a database. The creation of the database application as a freely accessible online resource is also part of this project.

Database Application

The project combines the automation of the allele call from high-throughput applications with the creation of a database to demonstrate the possibilities for future applications. Marker sets and data matrices can be applied directly and are made available to the forestry community as an online resource. The software and deposited database are integrated into a webpage, and tutorials and general explanations of the application are provided.

The implementation of a centralized database makes it possible to compare results between working groups. It also facilitates the verification of individual samples as well as their integration into existing studies. As a significant innovation, the database solution allows for the integration of new alleles determined for the markers, as well as the inclusion of new markers. By automating the allele call and integrating it into a data matrix, a system can be established that allows for continuous monitoring of genotyping in the future.