Project Description YAML fileΒΆ

Cookiecutter accept a YAML file as a config file for the project template creation. This YAML file is created from the parameters:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
{
    "author_name": "Roberto Vera Alvarez",
    "email": "veraalva@ncbi.nlm.nih.gov",
    "project_name": "my_ngs_project",
    "dataset_name": "my_dataset_name",
    "is_data_in_SRA": "y" or "n",
    "ngs_data_type": ["RNA-Seq", "ChIP-Seq", "ChIP-exo"],
    "sequencing_technology": ["single-end", "paired-end"],
    "create_demo": "y" or "n",
    "number_spots": "5000000",
    "organism": "human",
    "genome_dir": "/gfs/data/genomes/igenomes/Homo_sapiens/UCSC/hg19",
    "genome_name": "hg19",
    "aligner_index_dir": "{{ cookiecutter.genome_dir }}/ALIGNER",
    "genome_fasta": "{{ cookiecutter.genome_dir }}/genome.fa",
    "genome_gtf": "{{ cookiecutter.genome_dir }}/genome.gtf",
    "genome_gff": "{{ cookiecutter.genome_dir }}/genome.gff",
    "genome_gff3": "{{ cookiecutter.genome_dir }}/genome.gff3",
    "genome_bed": "{{ cookiecutter.genome_dir }}/genome.bed",
    "genome_chromsizes": "{{ cookiecutter.genome_dir }}/genome.sizes",
    "genome_mappable_size": "hg19",
    "genome_blacklist": "{{ cookiecutter.genome_dir }}/hg19-blacklist.bed",
    "fold_change": "2.0",
    "fdr": "0.05",
    "use_docker": "y" or "n",
    "pull_images": "y" or "n",
    "use_conda": "y" or "n",
    "cwl_runner": "cwl-runner",
    "cwl_workflow_repo": "https://github.com/ncbi/cwl-ngs-workflows-cbb",
    "create_virtualenv": "y" or "n",
    "use_gnu_parallel": "y" or "n",
    "max_number_threads": "16"
}

Parameters

  • author_name: Project author name

  • email: Author's email

  • project_name: Name of the project with no space nor especial characters. This will be used as project folder's name.

  • dataset_name: Dataset to process name with no space nor especial characters. This will be used as folder name to group the data. This folder will be created under the data/{{dataset_name}} and results/{{dataset_name}}.

  • is_data_in_SRA: If the data is in the SRA set this to y. A CWL workflow to download the data from the SRA database to the folder data/{{dataset_name}} and execute FastQC on it will be included in the 01 - Pre-processing QC.ipynb notebook.

    If this option is set to n, the fastq files should be copied to the folder data/{{dataset_name}}/

  • ngs_data_type: Select one of the available technologies to process:
    1. RNA-Seq

    2. ChIP-Seq

    3. ChIP-exo

  • sequencing_technology: Select one of the available sequencing technologies in your data:
    1. single-end

    2. paired-end

    Mixed datasets with single and paired-end samples should be processed independently.

  • create_demo: If the data is downloaded from the SRA and this option is set to y, then only the number of spots specified in the next variable will be downloaded. Useful to test the workflow.

  • number_spots: Number of sport to download from the SRA database. It is ignored is the create_demo is set to n.

  • organism: Organism to process, e.g. human. This is used to link the selected genes to the NCBI gene database.

  • genome_dir: Absolute path to the directory with the genome annotation to be used by the workflow.

  • genome_name: Genome name , e.g. hg38 or mm10.

  • aligner_index_dir: Absolute path to the directory with the aligner indexes.

  • genome_fasta: Absolute path to the directory to the genome fasta.

  • genome_gtf: Absolute path to the directory with the genome GTF.

  • genome_gff: Absolute path to the directory with the genome GFF.

  • genome_gff3: Absolute path to the directory with the genome GFF3.

  • genome_bed: Absolute path to the directory with the genome BED. All these files are note required to exist. It depends on the workflow executed.

  • genome_chromsizes: Genome chromosome sizes file like hg19.chrom.sizes.

  • genome_mappable_size: Genome mappable size used by MACS. For human can be hg38 or in case of other genomes it is a number.

  • genome_blacklist: Genome blacklist file.

  • fold_change: A real number used as fold change value, e.g. 2.0.

  • fdr: Adjusted P-Value to be used, e.g. 0.05.

  • use_docker: Set this to y if you will be using Docker.

  • pull_images: Set this to y if you want pull the required docker images during the project structure creation.

  • use_conda: Set this to y if you want to use Conda. The environments required by the ngs_data_type to process will be installed during the project structure creation.

  • cwl_runner: Absulute path to the cwl-runner.

  • cwl_workflow_repo: Always use: https://github.com/ncbi/cwl-ngs-workflows-cbb. This repo will be cloned in the bin folder.

  • create_virtualenv: Set this to y if not using Docker nor Conda for creating a Python virtual environment in a folder venv.

  • use_gnu_parallel: Use GNU Parallel for parallel execution of the jobs.

  • max_number_threads: Number of threads available in the host