pop-variant-calling

This best practice is to use PopGenomics CLI on volcengine platform to batch submit germline variant calling tasks

1. Get the workflow inputs and outputs json

At first we need to get the inputs and outputs json of the workflow, use the command:

$ popgen vol get-template -w PopGenomics -n germline-variant-calling-gpu

You will get two json files including the workflow inputs and outputs.

The outputs json file can be used directly, but the inputs json file nead to add some parameters to the workflow.

The template inputs json file like this:

{
    # if you want to split the GVCF and VCF file according to chromosome
    "germline_variant_calling_gpu.if_split_variant": "Boolean",
    # the tos path to fastq1 file
    "germline_variant_calling_gpu.inputFASTQ_1": "File",
    # the tos path to fastq2 file
    "germline_variant_calling_gpu.inputFASTQ_2": "File",
    # the tos path to the tar file of the reference genome index
    "germline_variant_calling_gpu.inputRefTarball": "File",
    # the tos path to the tar file of the interval file which used to split the GVCF and VCF 
    "germline_variant_calling_gpu.interval_tar": "File",
    # the sample name
    "germline_variant_calling_gpu.sample_name": "String" 
}

this is the WDL inputs json for the germline-variant-calling-gpu workflow.

In these 6 parameters, if_split_variant, inputRefTarball, interval_tar are the same for each sample, while inputFASTQ_1, inputFASTQ_2, sample_name need to match each sample, so you need to change the inputs json file into this:

{
    "germline_variant_calling_gpu.if_split_variant": true,
    "germline_variant_calling_gpu.inputFASTQ_1": "this.fastq1",
    "germline_variant_calling_gpu.inputFASTQ_2": "this.fastq2",
    "germline_variant_calling_gpu.inputRefTarball": "s3://bucket_name/path/to/ref.fa.tar",
    "germline_variant_calling_gpu.interval_tar": "s3://bucket_name/path/to/ref.interval.tar",
    "germline_variant_calling_gpu.sample_name": "this.sample_name"
}

The 3 constant parameters write directly in the json, while the last 3 dynamic parameters need to be read from the Data model, so add prefix this. to match the column name of the Data model columns.

2. Data model preparation

We suppose you have already uploaded your raw fastq files on the volcengine TOS (Tinder Object Storage) in Beijing region,if not you can follow tosutil upload data. And you need to get all the fastq files tos path(change path prefix from tos:// to s3://) and organize them to a meta data table in CSV format like this:

sample_name	fastq1	fastq2
sample1	s3://bucket_name/path/to/sample1_1.fastq.gz	s3://bucket_name/path/to/sample1_2.fastq.gz
sample2	s3://bucket_name/path/to/sample2_1.fastq.gz	s3://bucket_name/path/to/sample2_2.fastq.gz
sample3	s3://bucket_name/path/to/sample3_1.fastq.gz	s3://bucket_name/path/to/sample3_2.fastq.gz

The rules to organize the meta CSV file are:

The CSV file has header line.
The column names must include the sufix of parameters in the inputs json file which with this. prefix. For example: if you have a parameter this.sample_name in the inputs json file, you must have a column name sample_name in the CSV file.
The elements of the first column of the table must be unique

3. Create data model

Onece you have prepared the data model CSV file for a specific workflow, then you need to create a new data model to input paramters and store the results of the workflow.

# Create a new data model whose name is 'meta_data'
$ popgen vol create-data-model -w PopGenomics -n germline-variant-calling-gpu -i germline-variant-calling-gpu.inputs.json -m meta_data.csv

PopGenomics CLI will check if the meta_data.csv matches the inputs json and the workflow inputs, when there is no error, you will uploaded the data model successfully, and the data model name is the file name of the CSV file, this example the data model name is meta_data, you need to record the data model name in case you want to submit analysis tasks with this data model.

4. Submit the analysis tasks

With the data model you can submit the analysis tasks, use create-submission sub-command to submit the analysis tasks and polling the task status

$ popgen vol create-submission -w PopGenomics -n germline-variant-calling-gpu -i germline-variant-calling-gpu.inputs.json -o germline-variant-calling-gpu.outputs.json -m meta_data

In this command, the outputs json is used to receive the results of the workflow and write into the data model by adding a new column to the data model, like there is a parameter outputBAM in the outputs json, when a workflow runs successfully, the BAM file path will write into the data model meta_data’s new column outputBAM, be careful if you did not set this parameter in the command line, the result will not be written to the data model.

If you did not set -l parameter, the program will run all the samples in the data model by default, if you just want to run some of the samples in the data model or retry the failed samples in the data model, you can input a sample list which is the elements of the csv file’s first column, without header line, each element one line, For example:

sample1
sample3

You can also use the --webhook parameter to receive the polling message to feishu robot, so the complete command like this:

$ popgen vol create-submission \
    -w PopGenomics \
    -n germline-variant-calling-gpu \
    -i germline-variant-calling-gpu.inputs.json \
    -o germline-variant-calling-gpu.outputs.json \
    -m meta_data \
    -l samples.list \
    --webhook https://open.feishu.cn/open-apis/bot/v2/hook/********

If submit the analysis tasks successfully, you will get a submission ID which is important for you to check the submission status

5. Check the running status

When the batch submission is running, the CLI will polling the amount of succeeded, failed and running tasks, you can also use get-submission sub-command to get these information if the polling was terminated in accident.

When you want to know each tasks’ status in detail, you can use the get-runs sub-command

$ popgen vol get-runs -w PopGenomics -s submission_id -o output.csv

6. Get the results

When the submission is finished, you can use the get-data-model sub-command to get the results

$ popgen vol get-data-model -w PopGenomics -m meta_data

Important information

you need to record the data model name to submit and get the results
you need to record the submssion id to get the running status