Cancer Risk Predictor
Detailed Guide 详细说明
Workflow Overview 工作流概览
This tool triggers a comprehensive genomic analysis pipeline for cancer risk prediction. The workflow includes:
- Quality Control: FastQC for initial quality assessment
- Read Trimming: fastp 1.1.0 for adapter trimming and quality filtering
- Alignment: BWA-MEM2 v2.3 for mapping reads to reference genome
- Variant Calling: GATK 4.6.2.0 best practices pipeline
- Analysis: Variant annotation and risk prediction
Input Requirements 输入要求
FASTQ Files Format: Must be compressed with .fastq.gz extension ( .fq.gz is not supported ). Both single-end and paired-end reads are supported.
Directory Structure: All FASTQ files must be placed in a root folder named fastq. Each sample's files need to be stored in a separate subdirectory under fastq (required for parallel running).
File Naming Convention: Paired-end reads should be named R1.fastq.gz and R2.fastq.gz in the sample subdirectory. Example structure:
fastq/sample1/sample1.R1.fastq.gz
fastq/sample1/sample1.R2.fastq.gz
fastq/sample2/sample2.R1.fastq.gz
fastq/sample2/sample2.R2.fastq.gz
Folder Permission: The root fastq folder must be set to permissions 757 or 777 to allow read and write access for the Docker container.
Reference Genome 参考基因组
The reference directory (/mnt/d1/pool/sun/ghrunner/lyx/ref/hg38/) must contain the following hg38 files for fastq-fastp-bwa-gatk pipeline, with index files paired (required for BWA and GATK running):
- FASTA sequence + BWA/GATK indexes (core for genome alignment)
hg38.fa,hg38.fa.amb,hg38.fa.ann,hg38.fa.pac,hg38.fa.fai - GATK required sequence dictionary
hg38.dict(orHomo_sapiens_assembly38.dict, both are applicable) - Known variant databases (compressed VCF + tabix index
.tbi) (for GATK base quality recalibration and variant filtering)- dbSNP database:
dbSNP.hg38.vcf.gz+dbSNP.hg38.vcf.gz.tbi - 1000G gold standard indels:
Mills_and_1000G_gold_standard.indels.hg38.vcf.gz+Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
- dbSNP database:
Frequently Asked Questions 常见问题
Authentication Issues 认证问题
Q: My Secret Key is not working, what should I do?
A: It's possible that your key has expired. Please seek help at contact support in the footer.
Common Errors 常见错误
Q: "Workflow not found" error
A: Check that the workflow file gatk.yaml exists in the repository and you have permission to access it.
Q: "Path does not exist" error
A: Verify that all input paths exist and are accessible from the workflow environment. Use absolute paths. And please check if the input format complies with the requirements.
Q: Task fails at a certain step
A: All files in the same batch must successfully complete the current step to proceed to the next one. Please manually remove the erroneous files.
Q: Where to find the results
A: Intermediate output files for each step will be automatically created in the input path. If you want to run a single step independently, please select the corresponding individual analysis tool on the home page.
Q: Can I input the thread number arbitrarily
A: Thread numbers exceeding the upper limit will be automatically handled by the system, so no need to worry.
Example Configuration 实例配置
Example Input Path 示例输入路径
/pool/sun/ghrunner/lyx/20250729_GATK_pipeline_test/fastq/
├── sample1/sample1.R1.fastq.gz
├── sample1/sample1.R2.fastq.gz
├── sample2/sample2.R1.fastq.gz
└── sample2/sample2.R2.fastq.gz
Example Output 示例输出
/pool/sun/ghrunner/lyx/20250729_GATK_pipeline_test/vcf/
├── sample1/
│ ├── sample1_RG.sorted/
│ └── rg_added_bams/
├── sample2/
│ ├── sample2_RG.sorted/
│ └── rg_added_bams/
├── DNA-25M-1-test/
│ ├── DNA-25M-1-test_RG.sorted/
│ │ ├── DNA-25M-1-test_RG.sorted.dedup.bai
│ │ ├── DNA-25M-1-test_RG.sorted.dedup.bam
│ │ ├── DNA-25M-1-test_RG.sorted.dup_metrics.txt
│ │ ├── DNA-25M-1-test_RG.sorted.g.vcf.gz
│ │ ├── DNA-25M-1-test_RG.sorted.g.vcf.gz.tbi
│ │ ├── DNA-25M-1-test_RG.sorted.recal.bai
│ │ ├── DNA-25M-1-test_RG.sorted.recal.bam
│ │ └── DNA-25M-1-test_RG.sorted.recal_data.table
│ └── rg_added_bams/
│ ├── DNA-25M-1-test_RG.sorted.bam
│ └── DNA-25M-1-test_RG.sorted.bam.bai
└── sample2/
├── sample2_RG.sorted/
└── rg_added_bams/
# Intermediate steps output note
Results of intermediate steps (quality control, trimming, genome alignment) will also generate output with a similar directory structure.
Complete Input Parameters Example 完整输入参数示例
Input path: /pool/sun/ghrunner/lyx/20250729_GATK_pipeline_test
Reference path: /pool/sun/ghrunner/lyx/ref/hg38
Job name: test_run
Thread number: 64
Run location: pipeline
Token: [your Secret Key]