Cancer Risk Predictor

Path containing FASTQ files / 包含FASTQ文件的路径
Path to reference genome FA files / 参考基因组FA文件路径
Base name for the jobs / 任务基础名称
Number of threads for processing / 处理线程数
Location of the runtime environment / 运行环境位置

Detailed Guide 详细说明

Workflow Overview 工作流概览

This tool triggers a comprehensive genomic analysis pipeline for cancer risk prediction. The workflow includes:

  • Quality Control: FastQC for initial quality assessment
  • Read Trimming: fastp 1.1.0 for adapter trimming and quality filtering
  • Alignment: BWA-MEM2 v2.3 for mapping reads to reference genome
  • Variant Calling: GATK 4.6.2.0 best practices pipeline
  • Analysis: Variant annotation and risk prediction

Input Requirements 输入要求

FASTQ Files Format: Must be compressed with .fastq.gz extension ( .fq.gz is not supported ). Both single-end and paired-end reads are supported.

Directory Structure: All FASTQ files must be placed in a root folder named fastq. Each sample's files need to be stored in a separate subdirectory under fastq (required for parallel running).

File Naming Convention: Paired-end reads should be named R1.fastq.gz and R2.fastq.gz in the sample subdirectory. Example structure:

                        fastq/sample1/sample1.R1.fastq.gz
                        fastq/sample1/sample1.R2.fastq.gz
                        fastq/sample2/sample2.R1.fastq.gz
                        fastq/sample2/sample2.R2.fastq.gz
                        

Folder Permission: The root fastq folder must be set to permissions 757 or 777 to allow read and write access for the Docker container.

Reference Genome 参考基因组

The reference directory (/mnt/d1/pool/sun/ghrunner/lyx/ref/hg38/) must contain the following hg38 files for fastq-fastp-bwa-gatk pipeline, with index files paired (required for BWA and GATK running):

  1. FASTA sequence + BWA/GATK indexes (core for genome alignment)
    hg38.fa, hg38.fa.amb, hg38.fa.ann, hg38.fa.pac, hg38.fa.fai
  2. GATK required sequence dictionary
    hg38.dict (or Homo_sapiens_assembly38.dict, both are applicable)
  3. Known variant databases (compressed VCF + tabix index .tbi) (for GATK base quality recalibration and variant filtering)
    • dbSNP database: dbSNP.hg38.vcf.gz + dbSNP.hg38.vcf.gz.tbi
    • 1000G gold standard indels: Mills_and_1000G_gold_standard.indels.hg38.vcf.gz + Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi

Frequently Asked Questions 常见问题

Authentication Issues 认证问题

Q: My Secret Key is not working, what should I do?
A: It's possible that your key has expired. Please seek help at contact support in the footer.

Common Errors 常见错误

Q: "Workflow not found" error
A: Check that the workflow file gatk.yaml exists in the repository and you have permission to access it.

Q: "Path does not exist" error
A: Verify that all input paths exist and are accessible from the workflow environment. Use absolute paths. And please check if the input format complies with the requirements.

Q: Task fails at a certain step
A: All files in the same batch must successfully complete the current step to proceed to the next one. Please manually remove the erroneous files.

Q: Where to find the results
A: Intermediate output files for each step will be automatically created in the input path. If you want to run a single step independently, please select the corresponding individual analysis tool on the home page.

Q: Can I input the thread number arbitrarily
A: Thread numbers exceeding the upper limit will be automatically handled by the system, so no need to worry.

Example Configuration 实例配置

Example Input Path 示例输入路径

# Typical project structure 典型项目结构
/pool/sun/ghrunner/lyx/20250729_GATK_pipeline_test/fastq/
├── sample1/sample1.R1.fastq.gz
├── sample1/sample1.R2.fastq.gz
├── sample2/sample2.R1.fastq.gz
└── sample2/sample2.R2.fastq.gz

Example Output 示例输出

# Main output directory structure 主输出目录结构
/pool/sun/ghrunner/lyx/20250729_GATK_pipeline_test/vcf/
├── sample1/
│ ├── sample1_RG.sorted/
│ └── rg_added_bams/
├── sample2/
│ ├── sample2_RG.sorted/
│ └── rg_added_bams/
├── DNA-25M-1-test/
│ ├── DNA-25M-1-test_RG.sorted/
│ │ ├── DNA-25M-1-test_RG.sorted.dedup.bai
│ │ ├── DNA-25M-1-test_RG.sorted.dedup.bam
│ │ ├── DNA-25M-1-test_RG.sorted.dup_metrics.txt
│ │ ├── DNA-25M-1-test_RG.sorted.g.vcf.gz
│ │ ├── DNA-25M-1-test_RG.sorted.g.vcf.gz.tbi
│ │ ├── DNA-25M-1-test_RG.sorted.recal.bai
│ │ ├── DNA-25M-1-test_RG.sorted.recal.bam
│ │ └── DNA-25M-1-test_RG.sorted.recal_data.table
│ └── rg_added_bams/
│ ├── DNA-25M-1-test_RG.sorted.bam
│ └── DNA-25M-1-test_RG.sorted.bam.bai
└── sample2/
├── sample2_RG.sorted/
└── rg_added_bams/

# Intermediate steps output note
Results of intermediate steps (quality control, trimming, genome alignment) will also generate output with a similar directory structure.

Complete Input Parameters Example 完整输入参数示例

# All parameters filled 所有参数示例
Input path: /pool/sun/ghrunner/lyx/20250729_GATK_pipeline_test
Reference path: /pool/sun/ghrunner/lyx/ref/hg38
Job name: test_run
Thread number: 64
Run location: pipeline
Token: [your Secret Key]