Sentieon | Application Tutorial: Recommendations for Read Groups |...

Sentieon | Application Tutorial: Recommendations for Read Groups

Postado 2025-01-10 06:16:02

128

Introduction

This document describes the recommended usage of the RGID field when using Sentieon® Genomics software to minimize potential issues. This document will help you determine the best practices for setting the different fields of the RG tags in the bam files you use.

Detailed description of RG fields and their usage

Detailed description of RG fields

The SAM format specification http://samtools.github.io/hts-specs/SAMv1.pdf defines a read group as an identifier that groups reads together. The read group fields in a BAM file can contain the following tags:

ID: Identifier. Unique identifier for the read group. You need to ensure that the RGID is unique within the BAM file and also unique across multiple BAM files used in the same command pipeline. This field is required.

CN: Center name. Name of the sequencing center that produced the reads. This tag is typically not used.

DS: Description. Free-form description of the read group. This tag is typically not used.

DT: Date. Date the run was produced, following the ISO8601 date or date/time format. This tag is typically not used.

FO: Flow order. Array of nucleotides corresponding to the order of nucleotides used for each flow of each read. This tag is typically not used.

KS: Key sequence. Array of nucleotide bases corresponding to the key sequence of each read. This tag is typically not used.

LB: Library. Library used for sequencing the reads.

PG: Programs. Programs used to process the read group. Typically, relevant information is included in the PG field of the BAM file rather than set individually for each read group.

PI: Predicted median insert size. This tag is typically not used.

PL: Platform. Technology used for sequencing the reads. This tag is required if you plan to run BQSR as it is used to determine the correct error model to apply.

PM: Platform model. Free-form text providing more details about the platform/technology used. This tag is typically not used.

PU: Platform unit. Unique identifier used by the sequencing instrument that performed the sequencing. This tag is recommended if you intend to run BQSR, as BQSR will model all reads belonging to the same PU; if PU is missing, BQSR will model reads with the same RGID.

SM: Sample name. Name of the sample to which the reads belong. This field is required.

RG field tags and Sentieon®

The following are general principles for using RG field tags in Sentieon® tools:

When using multiple input bam files, it is necessary to make the ID tag unique for each bam file; two different bam input files cannot have RGs with the same ID.
Tools use the SM tag to identify reads belonging to the same sample and process them accordingly.
Deduplication uses the LB tag to determine groups that may contain duplicates, duplicate reads should belong to the same library.
The BQSR model requires the PL tag to determine which error model to apply. If there is no PL tag, BQSR will not be performed.
If a PU tag is present, BQSR modeling will be based on read groups identified by the PU tag; if no PU tag is present, BQSR modeling will be based on read groups identified by the ID tag.

Filling in RG field tags

Sentieon® recommends the following conventions for RG field tags:

ID: sample_name.flowcell.lane.barcode

SM: sample_name

PL: technology platform, e.g., ILLUMINA

PU: flowcell.lane

LB: sample_name.library_prep

These recommendations ensure that:

Read group IDs will be unique even across multiple bam files, even for the same sample sequenced in different lanes or using different libraries.
BQSR will create recalibrations based on actual unique sequencing units, which can be performed if multiple samples are sequenced on the same sequencing unit.
Tumor and normal sample names will be unique in somatic variant detection.