Project Description YAML file
=============================

Cookiecutter accept a YAML file as a config file for the project template creation. This YAML file is created from the
parameters:

.. code-block:: json
    :linenos:

    {
        "author_name": "Roberto Vera Alvarez",
        "email": "veraalva@ncbi.nlm.nih.gov",
        "project_name": "my_ngs_project",
        "dataset_name": "my_dataset_name",
        "is_data_in_SRA": "y" or "n",
        "ngs_data_type": ["RNA-Seq", "ChIP-Seq", "ChIP-exo"],
        "sequencing_technology": ["single-end", "paired-end"],
        "create_demo": "y" or "n",
        "number_spots": "5000000",
        "organism": "human",
        "genome_dir": "/gfs/data/genomes/igenomes/Homo_sapiens/UCSC/hg19",
        "genome_name": "hg19",
        "aligner_index_dir": "{{ cookiecutter.genome_dir }}/ALIGNER",
        "genome_fasta": "{{ cookiecutter.genome_dir }}/genome.fa",
        "genome_gtf": "{{ cookiecutter.genome_dir }}/genome.gtf",
        "genome_gff": "{{ cookiecutter.genome_dir }}/genome.gff",
        "genome_gff3": "{{ cookiecutter.genome_dir }}/genome.gff3",
        "genome_bed": "{{ cookiecutter.genome_dir }}/genome.bed",
        "genome_chromsizes": "{{ cookiecutter.genome_dir }}/genome.sizes",
        "genome_mappable_size": "hg19",
        "genome_blacklist": "{{ cookiecutter.genome_dir }}/hg19-blacklist.bed",
        "fold_change": "2.0",
        "fdr": "0.05",
        "use_docker": "y" or "n",
        "pull_images": "y" or "n",
        "use_conda": "y" or "n",
        "cwl_runner": "cwl-runner",
        "cwl_workflow_repo": "https://github.com/ncbi/cwl-ngs-workflows-cbb",
        "create_virtualenv": "y" or "n",
        "use_gnu_parallel": "y" or "n",
        "max_number_threads": "16"
    }


.. topic:: Parameters

    * **author_name**: Project author name
    * **email**: Author's email
    * **project_name**: Name of the project with no space nor especial characters. This will be used as project folder's
      name.
    * **dataset_name**: Dataset to process name with no space nor especial characters. This will be used as folder name to
      group the data. This folder will be created under the **data/{{dataset_name}}** and **results/{{dataset_name}}**.
    * **is_data_in_SRA**: If the data is in the SRA set this to **y**. A CWL workflow to download the data from the SRA
      database to the folder **data/{{dataset_name}}** and execute FastQC on it will be included in the
      **01 - Pre-processing QC.ipynb** notebook.

      If this option is set to **n**, the fastq files should be copied to the folder **data/{{dataset_name}}/**
    * **ngs_data_type**: Select one of the available technologies to process:
        1. RNA-Seq
        2. ChIP-Seq
        3. ChIP-exo
    * **sequencing_technology**: Select one of the available sequencing technologies in your data:
        1. single-end
        2. paired-end
      Mixed datasets with single and paired-end samples should be processed independently.
    * **create_demo**: If the data is downloaded from the SRA and this option is set to **y**, then only the number of
      spots specified in the next variable will be downloaded. Useful to test the workflow.
    * **number_spots**: Number of sport to download from the SRA database. It is ignored is the **create_demo** is set
      to **n**.
    * **organism**: Organism to process, e.g. human. This is used to link the selected genes to the NCBI gene database.
    * **genome_dir**: Absolute path to the directory with the genome annotation to be used by the workflow.
    * **genome_name**: Genome name , e.g. hg38 or mm10.
    * **aligner_index_dir**: Absolute path to the directory with the aligner indexes.
    * **genome_fasta**: Absolute path to the directory to the genome fasta.
    * **genome_gtf**: Absolute path to the directory with the genome GTF.
    * **genome_gff**: Absolute path to the directory with the genome GFF.
    * **genome_gff3**: Absolute path to the directory with the genome GFF3.
    * **genome_bed**: Absolute path to the directory with the genome BED.
      All these files are note required to exist. It depends on the workflow executed.
    * **genome_chromsizes**: Genome chromosome sizes file like `hg19.chrom.sizes`_.
    * **genome_mappable_size**: Genome mappable size used by MACS. For human can be hg38 or in case of other genomes
      it is a number.
    * **genome_blacklist**: Genome blacklist file.
    * **fold_change**: A real number used as fold change value, e.g. 2.0.
    * **fdr**: Adjusted P-Value to be used, e.g. 0.05.
    * **use_docker**: Set this to **y** if you will be using Docker.
    * **pull_images**: Set this to **y** if you want pull the required docker images during the project structure
      creation.
    * **use_conda**: Set this to **y** if you want to use Conda. The environments required by the **ngs_data_type**
      to process will be installed during the project structure creation.
    * **cwl_runner**: Absulute path to the cwl-runner.
    * **cwl_workflow_repo**: Always use: https://github.com/ncbi/cwl-ngs-workflows-cbb. This repo will be cloned in the
      **bin** folder.
    * **create_virtualenv**: Set this to **y** if not using Docker nor Conda for creating a Python virtual environment
      in a folder **venv**.
    * **use_gnu_parallel**: Use `GNU Parallel`_ for parallel execution of the jobs.
    * **max_number_threads**: Number of threads available in the host


.. _hg19.chrom.sizes: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes
.. _GNU Parallel: https://www.gnu.org/software/parallel/