Sentieon | Application Tutorial: TNscope® Using Machine Learning Models for Somatic Variant Discovery with Matched Normal Samples;
Using Machine Learning Models in TNscope®
Objectives of Machine Learning Models in TNscope®
TNscope® allows you to use machine learning models for variant filtering to improve the accuracy of results. The machine learning model approach is described in https://www.biorxiv.org/content/early/2018/01/19/250647, and TNscope® uses a series of sensitive settings to detect more candidate variants, followed by model-based variant filtering. Sentieon® provides you with machine learning models trained on multiple GIAB truth sets https://github.com/genome-in-a-bottle.
Using Machine Learning Models in TNscope®
Three separate commands need to be run to call variants with high sensitivity settings, apply the machine learning model, and set the model threshold using BCFtools. The input BAM files should have undergone alignment, deduplication, and BQSR processing.
sentieon driver -t NUMBER_THREADS -r REFERENCE \
-i TUMOR_DEDUPED_BAM -q TUMOR_RECAL_DATA.TABLE \
-i NORMAL_DEDUPED_BAM -q NORMAL_RECAL_DATA.TABLE \
--algo TNscope --tumor_sample TUMOR --normal_sample NORMAL \
--clip_by_minbq 1 --max_error_per_read 3 --disable_detector sv \
--min_init_tumor_lod 2.0 --min_base_qual 10 --min_base_qual_asm 10 \
--min_tumor_allele_frac 0.00005 TMP_VARIANT_VCF
sentieon driver -t NUMBER_THREADS -r REFERENCE --algo TNModelApply \
--model ML_MODEL -v TMP_VARIANT_VCF VARIANT_VCF
bcftools filter -s "ML_FAIL" -i "INFO/ML_PROB > $ML_THRESHOLD" VARIANT_VCF \
-O z -m x -o FILTER_VARIANT_VCF
The following are the input parameters required for the commands:
NUMBER_THREADS: The number of threads to be used in the computation. It is recommended not to exceed the number of available computational cores in the system.
REFERENCE: The reference genome FASTA file. Ensure that the reference genome file is the same as the one used in the alignment stage.
TUMOR_DEDUPED_BAM: The deduplicated BAM file of the tumor sample.
TUMOR_RECAL_DATA.TABLE: The BQSR result file for the tumor sample.
NORMAL_DEDUPED_BAM: The deduplicated BAM file of the normal sample.
NORMAL_RECAL_DATA.TABLE: The BQSR result file for the normal sample.
TUMOR: The SM tag name of the tumor sample in the BAM file.
NORMAL: The SM tag name of the normal sample in the BAM file.
TMP_VARIANT_VCF: The location and filename for the temporary output of TNscope® variant calling.
VARIANT_VCF: The location and filename for the variant calling output. A corresponding index file will be created. The software will output a compressed gz file.
FILTER_VARIANT_VCF: The filename for the variant calling output after setting the final threshold. Due to the -O z option, the output file will be a bgzip-compressed vcf.gz file.
ML_MODEL: The machine learning model file.
$ML_THRESHOLD: The threshold for the probability that a variant is true according to the model. It is recommended to use 0.81.
- Art
- Causes
- Crafts
- Dance
- Drinks
- Film
- Fitness
- Food
- Games
- Gardening
- Health
- Home
- Literature
- Music
- Networking
- Other
- Party
- Religion
- Shopping
- Sports
- Theater
- Wellness