Sentieon Software Quick Start Guide

0
630

Sentieon software is a comprehensive, pure software solution for secondary analysis of genetic variant detection. Its analysis pipeline fully adheres to the mathematical models of gold standards such as BWA, GATK, MuTect2, STAR, Minimap2, Fgbio, and Picard. While matching the results of open-source pipeline analysis, it significantly improves the analysis efficiency and detection accuracy of WGS, WES, Panel, UMI, ctDNA, RNA, and other sequencing data. It is compatible with all current second and third-generation sequencing platforms, performing SNP/INDEL/SV analysis on short-read NGS and long-read sequencing data. It supports joint variant detection analysis for up to 200,000 whole genomes.

Sentieon is a pure CPU-accelerated software, fully compatible with mainstream CPU computing architectures: Intel, AMD, Hygon, and other x86 architecture CPUs, as well as Huawei Kunpeng, Alibaba Yitian, and other ARM architecture CPUs. It can be flexibly deployed on laboratory workstations, HPC clusters, supercomputing centers, and cloud computing centers, maintaining consistency in computational results across different scales of data under the same pipeline.

The Sentieon software team possesses rich experience in software development and algorithm optimization engineering. They are dedicated to solving speed and accuracy bottlenecks in biological data analysis, providing efficient and precise software solutions for partners from various fields such as molecular diagnostics, drug development, clinical medicine, population cohorts, and animal and plant research, jointly promoting the development of genetic technology.

As of the end of 2023, Sentieon has provided services to over 1300 users worldwide and has been widely cited in top-tier impact factor journals such as NEJM, Cell, and Nature, with nearly a thousand citations. Furthermore, Sentieon has consistently won accolades in authoritative evaluations such as Precision FDA and Dream Challenges for several consecutive years, gaining widespread recognition in the industry.


I、Operating Environment

To start using Sentieon® software, you need the following:

1.1 Hardware Requirements:

A Linux server with the following configuration:

  • Running one of the following Linux distributions or higher: RedHat/CentOS 6.5, Debian 7.7, OpenSUSE-13.2, or Ubuntu-14.04.
  • 16GB of memory for small panels or whole exome sequencing; 64GB of memory for whole genome sequencing.
  • (Recommended) High-speed SSD drives for optimal I/O performance and maximum CPU utilization.

1.2 Software Requirements:

  • Python 2.6.x, Python 2.7.x, or Python 3.x is required. You can check your Python version with the following command:
python --version

1.3 Software Installation Package:

  • Download the software installation package (using v202308.03 as an example, contact sentieon@insvast.com for updated versions):

X86 CPU version: https://insvast-download.oss-cn-shanghai.aliyuncs.com/Sentieon/release/sentieon-genomics-202308.03.tar.gz

ARM CPU version: https://insvast-download.oss-cn-shanghai.aliyuncs.com/Sentieon/release/arm-sentieon-genomics-202308.03.tar.gz

  • Extract the package using the following command, where VERSION is the version you're using, e.g., 202308.03:
tar xvzf sentieon-genomics-VERSION.tar.gz

1.4 License Requirements:

Please refer to "Chapter 4: License Setup" for detailed information on setting up the license. IT support may be required.

1.5 System Environment Requirements:

  • If Python 2.6.x, Python 2.7.x, or Python 3.x is not the default Python version, set the following environment variable:
export SENTIEON_PYTHON=Python_location

  • If using a local host license file, set the following environment variable, where LICENSE_DIR is the directory containing the license file and LICENSE_FILE.lic is the license filename:
export SENTIEON_LICENSE=LICENSE_DIR/LICENSE_FILE.lic

  • If using a license server, set the following environment variable, where LICSRVR_HOST and LICSRVR_PORT are the license server's hostname and port, respectively. See the next section for details.
export SENTIEON_LICENSE=LICSRVR_HOST:LICSRVR_PORT

  • For convenience, set the binary path as follows, where PATH_TO_SENTIEON_BINARY_DIRECTORY is the installation directory of Sentieon® binaries:
export SENTIEON_INSTALL_DIR=PATH_TO_SENTIEON_BINARY_DIRECTORY

  • When using NFS storage, to improve performance, set the SENTIEON_TMPDIR environment variable to point to local fast temporary storage:
export SENTIEON_TMPDIR=/tmp

II. Running Your First Job

We provide a quick start demo project with sample scripts and data to help you quickly test the installation and diagnose potential issues.

Quickstart demo project link: https://insvast-download.oss-cn-shanghai.aliyuncs.com/Sentieon/release/sentieon_quickstart.tar.gz

The quickstart package contains data for a single chromosome, including sequence data and reference materials for the sample. The job script processes a set of paired-end Illumina fastq files using the Sentieon DNAscope pipeline:

  • BWA: Aligns reads to the reference genome.
  • Metrics and LocusCollector: Collect read statistics.
  • Dedup: Removes duplicate reads.
  • Variant calling: DNAscope variant detection.

Note: DNAscope is only recommended for samples from diploid organisms. For other samples, use DNAseq.

2.1 Running the Quickstart Package

To begin, copy the downloaded quickstart package to a new directory and extract it by running:

tar xzvf sentieon_quickstart.tar.gz

The package contains:

  • sentieon_quickstart.sh: Sample shell script driving the entire process.
  • reference: Directory containing human genome reference files and known SNP site database files.
  • models: Directory containing DNAscope model files.
  • FASTQ files: Sample sequence files.

Before running the script, ensure the above environment variables are correctly set, including license and directory paths. Then edit the user settings in sentieon_quickstart.sh using your preferred editor.

# Update the location of the Sentieon software package
SENTIEON_INSTALL_DIR=/home/release/sentieon-genomics-202308.03

# Update and uncomment the location of temporary fast storage
#SENTIEON_TMPDIR=/tmp

# It's important to assign meaningful names in real situations.
# Particularly important is to assign different names for different read groups.
sample="sample_name"
group="read_group_name"
platform="ILLUMINA"

# Other settings
nt=16 # number of threads to use in computation

# Whether the data uses PCR-free library preparation
PCRFREE=true

Note: In the user settings of the shell script sentieon_quickstart.sh:

  • It's important to assign meaningful names in real situations.
  • It's particularly important to assign different names for different read groups.

To get the number of CPU cores, users can run nproc as follows:

nproc

To better understand the rest of the sentieon_quickstart.sh script, read the comments for each section and the corresponding chapters in the manual.

Now, simply run sentieon_quickstart.sh to start the script and observe the results unfold. On a typical Linux server, the entire run takes about 3-5 minutes. Actual time depends on the computing environment.

sh sentieon_quickstart.sh &

III. Sentieon Module Descriptions

The following table shows the different Sentieon software modules and their purposes. It also indicates which tools implement the same functionality as existing GATK pipeline tools.

3.1 List of Sentieon Modules

Table 1 Sentieon tools

IV. Setting Up the License

Sentieon® software is commercially licensed. Users need to set up the license correctly to run the software. We provide two types of licenses:

  • Single-machine evaluation license: This license is for evaluating Sentieon® software on a single machine. It allows new users to quickly start using the software without IT department assistance. To use this license, the computer running Sentieon® software needs external Internet access.
  • Cluster license: This license is for cluster environments. With this license, a lightweight floating license server process runs on one node in the cluster, providing licenses via TCP to all other nodes with network connectivity to the license server.

4.1 Setting Up a Single-Machine Evaluation License

To use a single-machine evaluation license, the compute node needs Internet access. This allows Sentieon® software to verify the license.

To use a single-machine evaluation license, follow these steps:

  1. Copy the license file to the compute node. For example, the license file LICENSE_FILE.lic is now located in LICENSE_DIR.
  2. Set the environment variable as follows:
export SENTIEON_LICENSE=LICENSE_DIR/LICENSE_FILE.lic

4.2 Setting Up the License Server

As shown in the Sentieon software licensing topology diagram below, the license server needs to meet the following conditions:

  • The license server should be able to access the Internet for license verification.
  • Compute nodes should be able to access the license server via the hostname LICSRVR_HOST.
  • The machine running the license server has an open port for the license service to listen on, and compute nodes can access this port. We assume the available port is LICSRVR_PORT.

 

You may need IT colleague support to obtain LICSRVR_HOST:LICSRVR_PORT and confirm the above requirements are met.

Note: If the license server is behind a firewall, separated from compute nodes by NAT, the hostname/IP of the license server visible to nodes may differ from its actual hostname/IP. If this is the case, you need to bind the license server to the actual IP address, while compute nodes request licenses from the IP address behind NAT. Please contact sentieon@insvast.com for more details.

Follow these steps to obtain the license file, set up and test the license server:

1. Send the following information to sentieon@insvast.com to receive the license file:

  • Specify the FQDN (Hostname) LICSRVR_HOST of the machine running the license service.
  • Specify the port LICSRVR_PORT.

2. Copy the received license file to the license server LICSRVR_HOST. We assume the license file is located at LICENSE_PATH/LICENSE_FILE. Run the following command on the license server to start the license server process:

<SENTIEON_INSTALL_DIR>/bin/sentieon licsrvr --start --log LOG_FILE LICENSE_PATH/LICENSE_FILE

3. Alternatively, you can configure and start the license server as a system daemon by following the instructions in "Chapter Five: Setting Up the License as a System Service" - Running the License Server (LICSRVR) as a System Service.

4. Go to the Sentieon® installation directory. Run the following command on the license server to confirm that the license server has started and is running:

<SENTIEON_INSTALL_DIR>/bin/sentieon licclnt ping -s LICSRVR_HOST:LICSRVR_PORT

If the command returns without error messages, the license server has started and is running.

5. Log in to one of the compute nodes, go to the Sentieon® installation directory, and run the above command again:

<SENTIEON_INSTALL_DIR>/bin/sentieon licclnt ping -s LICSRVR_HOST:LICSRVR_PORT

If the command returns without error messages, the compute node can now also access the license server.

6. Set the following environment variable, and you're ready to go:

export SENTIEON_LICENSE=LICSRVR_HOST:LICSRVR_PORT

V. Setting Up the License as a System Service

5.1 Running the License Server as a System Service Using sysvinit

If your system follows the traditional System V init startup scripts, you can set up the license server to start automatically at system boot by running the following commands as root:

1.Create and customize the configuration file: The configuration file is typically /etc/sysconfig/licsrvr; but on Ubuntu, it will be /etc/default/licsrvr. Here's an example of the configuration file, with recommended settings:

/home/sentieon/release/latest is a symlink to the latest Sentieon® software package installation directory
/home/sentieon/licsrvr is the folder for running the licsrvr service
/home/sentieon/licsrvr/licsrvr.lic is the Sentieon® license file
licsrvr="/home/sentieon/release/latest/bin/sentieon licsrvr"
licfile="/home/sentieon/licsrvr/licsrvr.lic"
logfile="/home/sentieon/licsrvr/licsrvr.log"

2. Install the license server startup script to the /etc/init.d directory. The startup script is included in the release package.

install -m 0755 $SENTIEON_INSTALL_DIR/doc/licsrvr.sh /etc/init.d/licsrvr

3. Install and enable the service. Depending on your system, you will run different commands:

  • If your system has the Linux Standard Base Core Specifications installed, execute the system init script installation script:
/usr/lib/lsb/install_initd /etc/init.d/licsrvr

  • If your system doesn't have the lsb.conformance package installed, use the chkconfig command to enable the service:
chkconfig --add licsrvr
chkconfig licsrvr on

  • For Ubuntu and Debian systems, if you don't have the lsb/install_initd binary and choose not to install the lsb-core package, use the update-rc.d command to install and enable the service:
update-rc.d licsrvr defaults
update-rc.d licsrvr enable

4. You can use the service command to start/stop/restart/check the status of the service:

service licsrvr {start|stop|restart|status}

5.2 Running the License Server as a System Service Using systemd

You can use your operating system's systemd system and service manager to set up the license server to start automatically in your system. To do this, run the following commands as root:

1.If you use the licsrvr.service license server startup script from the doc folder, you need to create the necessary files required by the script, including using the sentieon username:

  • /home/sentieon/release/latest is a symlink to the latest Sentieon® software package installation directory
  • /home/sentieon/licsrvr is the folder for running the licsrvr service
  • /home/sentieon/licsrvr/licsrvr.lic is the Sentieon® license file

Alternatively, you can edit the license server startup script to point to your specific username and/or location information.

2. Install the license server startup script to the /etc/systemd/system directory:

install -m 0644 $SENTIEON_INSTALL_DIR/doc/licsrvr.service /etc/systemd/system

3. Run the following command to enable the license server to start automatically when the machine boots:

systemctl enable licsrvr.service

4. You can use the systemctl command to manually start and stop the service:

systemctl start licsrvr.service
systemctl stop licsrvr.service

VI. Common Issues

6.1 Jemalloc error during Sentieon installation

During the installation and running of Sentieon, you may encounter the following error:

ERROR: ld.so: object '/usr/lib64/libjemalloc.so.2' from LD_PRELOAD cannot be preloaded: ignored. Failed to contact the license server at 10.10.10.1:8990

This error is related to jemalloc. Jemalloc is a memory allocator optimized for high memory allocation performance and less memory fragmentation in multi-threaded scenarios. Sentieon recommends using jemalloc to improve memory management and overall performance in Sentieon applications, especially Sentieon bwa-mem.

Solution:

1.Install jemalloc:

For different operating systems, the installation commands are as follows:

• RHEL/CentOS 8.x:

yum install epel-release
yum install jemalloc

Default installation in /usr/lib64/libjemalloc.so.2

• RHEL/CentOS 7.x:

yum install epel-release
yum install jemalloc

Default installation in /usr/lib64/libjemalloc.so.1

• Ubuntu 20.04:

apt update
apt install libjemalloc2

Default installation in /usr/lib/x86_64-linux-gnu/libjemalloc.so.2

• Ubuntu 18.04:

apt update
apt install libjemalloc1

Default installation in /usr/lib/x86_64-linux-gnu/libjemalloc.so.1

2. For other systems without pre-built packages, refer to the jemalloc GitHub page (https://github.com/jemalloc/jemalloc) for more information on how to build and install jemalloc.

3. Use environment variables to load the jemalloc library into Sentieon at runtime: For example, on a CentOS 8.x system, you can set the environment variable before running Sentieon tools with the following command:

export LD_PRELOAD=/usr/lib64/libjemalloc.so.2

6.2 Preparing reference files for use

If your reference FASTA file has not been pre-processed to make the data usable as specified in the software, you need to process it according to the following steps:

1.Generate BWA index using BWA.

This will create ".fasta.amb", ".fasta.ann", ".fasta.bwt", ".fasta.pac", and ".fasta.sa" files.

sentieon bwa index reference.fasta

2. Generate FASTA file index using samtools. This will create the ".fasta.fai" file.

samtools faidx reference.fasta

3. Generate sequence dictionary using Picard. This will create the ".dict" file.

java -jar picard.jar CreateSequenceDictionary REFERENCE=reference.fasta \
OUTPUT=reference.dict

6.3 Preparing RefSeq files for use

RefSeq files are used to aggregate the results of the CoverageMetrics algorithm to the gene level. To use RefSeq files downloaded from the UCSC Genome Browser, they need to be sorted by chromosome and position. To perform the sorting, you need to execute the following steps:

1. Remove the header from the file

grep -v "^#" FILE.refSeq > FILE.refSeq.headerless
grep -e "^#" FILE.refSeq > FILE.refSeq.header

2. First sort by position using unix sort.

sort -k 5 -n FILE.refSeq.headerless > FILE.refSeq.presorted

3. Sort by chromosome using GATK sortByRef.pl (available from) and the FASTA index fai.

perl sortByRef.pl --k 3 FILE.refSeq.presorted FASTA.fai --tmp ~/tmp \
> FILE_sorted_headerless.refSeq

4. Put the header back into the file.

cat FILE.refSeq.header > FILE_sorted.refSeq
cat FILE_sorted_headerless.refSeq >> FILE_sorted.refSeq

6.4 License message: No more license available for Sentieon…

This message is produced when you request to run Sentieon® software with more threads than your license currently allows. This situation occurs because the set of commands you are running simultaneously collectively request more threads than the number of cores supported by your license. You can use the following command to view the remaining number of threads authorized:

sentieon licclnt query -s LICSRVR_HOST:LICSRVR_PORT klib

Sentieon® commands will remain idle while waiting for an available license, but the command will not fail.

6.5 Driver fails with error: Readgroup XX is present in multiple BAM files with different attributes

This error is produced when you input two different BAM files that contain read groups with the same ID but different attributes. For example, in TNseq® and TNscope®, both tumor and normal sample BAM files have RG ID "1".

Before using BAM files, you need to edit them to make the RG IDs unique, for example by adding the SM name to the RG ID. You can view an example solution to this problem.

Alternatively, you can use the samtools addreplacerg function to modify the RG ID of the input BAM file and make it unique:

# Add new RG and modify all reads in the BAM file
RGtag=$(samtools view -H $INPUT_BAM|grep ^@RG|sed "s|ID:$ORIG_RGID|ID:$NEW_RGID|g")
samtools addreplacerg -r "$RGtag" -o $TMP_BAM $INPUT_BAM

# Reset the BAM header to remove the original RG that is no longer used
samtools view -H $TMP_BAM|grep -v "^@RG.*$ORIGINAL_RGID" \
|samtools reheader - $TMP_BAM > $OUTPUT_BAM
rm $TMP_BAM

6.6 Driver reports warning: none of the QualCal tables is applicable to the input BAM files

This warning means that none of the information in the recalibration table input file can be applied to the input BAM file, which may be due to using a recalibration table that does not correspond to the BAM file.

This warning may be produced when the input BAM file for QualCal does not have the correct fields in the RG field. For example, this can happen if the PL tag of the RG is set to a value other than ILLUMINA; in this case, you need to modify the BAM header to include/modify the missing/incorrect fields, for which you can use the samtools reheader function.

6.7 BWA uses abnormally large amounts of memory when using FASTQ files created from BAM files

When you use FASTQ files created by converting sorted BAM files, it may happen that all unmapped reads are grouped at the end of the FASTQ input. In this case, BWA may use abnormally large amounts of memory at the end of alignment because reads with poor mapping quality or that cannot be mapped require additional memory.

To reduce abnormal memory usage, you should first reorder the bam file to ensure that unmapped reads are not grouped together. You can do this using samtools:

samtools sort -n -@ 32 input.bam | samtools fastq -@ 32 \
-s >(gzip -c > single.fastq.gz) -0 >(gzip -c > unpaired.fastq.gz) \
-1 >(gzip -c > output_1.fastq.gz) -2 >(gzip -c > output_2.fastq.gz) -

6.8 KPNS - Known Problems No Solution

6.8.1 Gzip compressed vcf files not compressed with bgzip are not supported

Regular gzip files do not allow random or indexed access to the information they contain,nly files compressed with bgzip can be indexed. Therefore, Sentieon® software does not support gzip compressed VCF files as input. To use these files, you need to decompress them using gunzip and then use them uncompressed or recompress them using bgzip. Alternatively, you can use util vcfconvert to recompress and index the files.

sentieon util vcfconvert INPUT.vcf.gz OUTPUT.vcf.gz

6.8.2 Gzip compressed fasta files are not supported

Currently, the software does not support gzip compressed FASTA files as input. You need to gunzip these files before use.

6.8.3 FASTQ files need to be in SANGER quality format

If your FASTQ files are encoded using Illumina™ sequencing technology prior to 1.8, the read quality scores will not be in SANGER format, which may produce unexpected results. Sentieon® genomics software will not detect that you are using an unsupported format.

6.8.4 Driver fails with error: ImportError: No module named argparse

This error is produced when running tnhapfilter in an environment where the Python version is 2.6.x and the argparse module does not exist. You need to install the argparse module for your Python installation; you can do this by running pip install argparse or other package manager you use.

6.9 Common usage issues

Below is a list of common issues and their solutions.

6.9.1 Driver or Util fails with error: can not open file (xxx) in mode(r), Too many open files

The root cause of this error is that the limit on simultaneously open files in the system is set too low. You can resolve this error by setting the system ulimit -n. On Linux-based systems:

Check the limit on the maximum number of open files in the system by running the following command:

ulimit -n

Set a higher limit by editing the file /etc/security/limits.conf as root and adding the following 2 lines:

* soft nofile 16384
* hard nofile 16384

If your system is running Ubuntu, you also need to add this line to your shell configuration file ~/.bashrc:

ulimit -n 16384

You need to log out of the system and log back in for the changes to take effect. After logging in, check if the changes were applied correctly by running the following command:

ulimit -n

The command should return 16384.

6.9.2 Driver fails with error: Contig XXX from vcf/bam is not present in the reference, or error Contig XXX has different size in vcf/bam than in the reference

The root cause of this error is that the input VCF or BAM file is incompatible with the reference fasta file. The contig in the file does not exist in the reference, or the size of the contig is different. This is likely due to using a VCF or BAM file processed with a different reference.

6.9.3 Driver reports warning: Contigs in the vcf file XXX do not match any contigs in the reference

The root cause of this warning is that the input VCF file is incompatible with the reference fasta file, the contig in the file does not exist in the reference. This is likely due to using a VCF file from a different reference.

6.9.4 BWA fails with error: Killed

This error is produced when BWA receives a SIGKILL signal from the operating system. SIGKILL may be sent by the kernel's out-of-memory (OOM) manager if the system's available memory is insufficient. You can check the kernel logs on the system to confirm if the SIGKILL signal was sent by the OOM manager.

To resolve this error, you can reduce BWA's memory usage using the bwt_max_mem environment variable.