cfsan_bettercallsal: 0.6.1/readme/bettercallsal

annotate 0.6.1/readme/bettercallsal_db.md @ 16:b90e5a7a3d4f

"planemo upload"

author	kkonganti
date	Thu, 07 Sep 2023 15:22:10 -0400
parents	749faef1caa9
children

rev	line source
kkonganti@11	1 # bettercallsal_db
kkonganti@11	2
kkonganti@11	3 `bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.2727`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`).
kkonganti@11	4
kkonganti@11	5 The `bettercallsal_db` workflow should finish within an hour with stable internet connection.
kkonganti@11	6
kkonganti@11	7 \
kkonganti@11	8
kkonganti@11	9
kkonganti@11	10 ## Workflow Usage
kkonganti@11	11
kkonganti@11	12 ```bash
kkonganti@11	13 cpipes --pipeline bettercallsal_db [options]
kkonganti@11	14 ```
kkonganti@11	15
kkonganti@11	16 \
kkonganti@11	17
kkonganti@11	18
kkonganti@11	19 Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db/PDG000000002.2727`.
kkonganti@11	20
kkonganti@11	21 ```bash
kkonganti@11	22 cpipes
kkonganti@11	23 --pipeline bettercallsal_db \
kkonganti@11	24 --pdg_release PDG000000002.2727 \
kkonganti@11	25 --output /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2727
kkonganti@11	26 ```
kkonganti@11	27
kkonganti@11	28 \
kkonganti@11	29
kkonganti@11	30
kkonganti@11	31 Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option.
kkonganti@11	32
kkonganti@11	33 ```bash
kkonganti@11	34 cpipes
kkonganti@11	35 --pipeline bettercallsal \
kkonganti@11	36 --input /path/to/illumina/fastq/dir \
kkonganti@11	37 --output /path/to/output \
kkonganti@11	38 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2727
kkonganti@11	39 ```
kkonganti@11	40
kkonganti@11	41 \
kkonganti@11	42
kkonganti@11	43
kkonganti@11	44 ## Note
kkonganti@11	45
kkonganti@11	46 Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by Nextflow. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting.
kkonganti@11	47
kkonganti@11	48 \
kkonganti@11	49
kkonganti@11	50
kkonganti@11	51 ## `bettercallsal_db` CLI Help
kkonganti@11	52
kkonganti@11	53 ```text
kkonganti@11	54 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help
kkonganti@11	55 N E X T F L O W ~ version 23.04.3
kkonganti@11	56 Launching `./bettercallsal/cpipes` [special_brenner] DSL2 - revision: 8da4e11078
kkonganti@11	57 ================================================================================
kkonganti@11	58 (o)
kkonganti@11	59 ___ _ __ _ _ __ ___ ___
kkonganti@11	60 / __\|\| '_ \ \| \|\| '_ \ / _ \/ __\|
kkonganti@11	61 \| (__ \| \|_) \|\| \|\| \|_) \|\| __/\__ \
kkonganti@11	62 \___\|\| .__/ \|_\|\| .__/ \___\|\|___/
kkonganti@11	63 \| \| \| \|
kkonganti@11	64 \|_\| \|_\|
kkonganti@11	65 --------------------------------------------------------------------------------
kkonganti@11	66 A collection of modular pipelines at CFSAN, FDA.
kkonganti@11	67 --------------------------------------------------------------------------------
kkonganti@11	68 Name : bettercallsal
kkonganti@11	69 Author : Kranti Konganti
kkonganti@11	70 Version : 0.6.1
kkonganti@11	71 Center : CFSAN, FDA.
kkonganti@11	72 ================================================================================
kkonganti@11	73
kkonganti@11	74 Workflow : bettercallsal_db
kkonganti@11	75
kkonganti@11	76 Author : Kranti Konganti
kkonganti@11	77
kkonganti@11	78 Version : 0.6.1
kkonganti@11	79
kkonganti@11	80
kkonganti@11	81 Required :
kkonganti@11	82
kkonganti@11	83 --output : Absolute path to directory where all the
kkonganti@11	84 pipeline outputs should be stored. Ex: --
kkonganti@11	85 output /path/to/output
kkonganti@11	86
kkonganti@11	87 Other options :
kkonganti@11	88
kkonganti@11	89 --wcomp_serocol : Column number (non 0-based index) of the
kkonganti@11	90 PDG metadata file by which the serotypes
kkonganti@11	91 are collected. Default: false
kkonganti@11	92
kkonganti@11	93 --wcomp_complete_sero : Skip indexing serotypes when the serotype
kkonganti@11	94 name in the column number 49 (non 0-based)
kkonganti@11	95 of PDG metadata file consists a "-". For
kkonganti@11	96 example, if an accession has a serotype=
kkonganti@11	97 string as such in column number 49 (non 0-
kkonganti@11	98 based): "serotype=- 13:z4,z23:-" then, the
kkonganti@11	99 indexing of that accession is skipped.
kkonganti@11	100 Default: false
kkonganti@11	101
kkonganti@11	102 --wcomp_not_null_serovar : Only index the computed_serotype column i.e
kkonganti@11	103 . column number 49 (non 0-based), if the
kkonganti@11	104 serovar column is not NULL. Default: false
kkonganti@11	105
kkonganti@11	106 --wcomp_i : Force include this serovar. Ignores --
kkonganti@11	107 wcomp_complete_sero for only this serovar.
kkonganti@11	108 Mention multiple serovars separated by a
kkonganti@11	109 ! (Exclamation mark). Ex: --
kkonganti@11	110 wcomp_complete_sero I 4,[5],12:i:-!Agona
kkonganti@11	111 Default: false
kkonganti@11	112
kkonganti@11	113 --wcomp_num : Number of genome accessions to be collected
kkonganti@11	114 per serotype. Default: false
kkonganti@11	115
kkonganti@11	116 --wcomp_min_contig_size : Minimum contig size to consider a genome
kkonganti@11	117 for indexing. Default: false
kkonganti@11	118
kkonganti@11	119 --wsnp_serocol : Column number (non 0-based index) of the
kkonganti@11	120 PDG metadata file by which the serotypes
kkonganti@11	121 are collected. Default: false
kkonganti@11	122
kkonganti@11	123 --wsnp_complete_sero : Skip indexing serotypes when the serotype
kkonganti@11	124 name in the column number 49 (non 0-based)
kkonganti@11	125 of PDG metadata file consists a "-". For
kkonganti@11	126 example, if an accession has a serotype=
kkonganti@11	127 string as such in column number 49 (non 0-
kkonganti@11	128 based): "serotype=- 13:z4,z23:-" then, the
kkonganti@11	129 indexing of that accession is skipped.
kkonganti@11	130 Default: true
kkonganti@11	131
kkonganti@11	132 --wsnp_not_null_serovar : Only index the computed_serotype column i.e
kkonganti@11	133 . column number 49 (non 0-based), if the
kkonganti@11	134 serovar column is not NULL. Default: false
kkonganti@11	135
kkonganti@11	136 --wsnp_i : Force include this serovar. Ignores --
kkonganti@11	137 wsnp_complete_sero for only this serovar.
kkonganti@11	138 Mention multiple serovars separated by a
kkonganti@11	139 ! (Exclamation mark). Ex: --
kkonganti@11	140 wsnp_complete_sero I 4,[5],12:i:-!Agona
kkonganti@11	141 Default: 'I 4,[5],12:i
kkonganti@11	142
kkonganti@11	143 --wsnp_num : Number of genome accessions to collect per
kkonganti@11	144 SNP cluster. Default: false
kkonganti@11	145
kkonganti@11	146 --mashsketch_run : Run `mash screen` tool. Default: true
kkonganti@11	147
kkonganti@11	148 --mashsketch_l : List input. Lines in each <input> specify
kkonganti@11	149 paths to sequence files, one per line.
kkonganti@11	150 Default: true
kkonganti@11	151
kkonganti@11	152 --mashsketch_I : <path> ID field for sketch of reads (
kkonganti@11	153 instead of first sequence ID). Default:
kkonganti@11	154 false
kkonganti@11	155
kkonganti@11	156 --mashsketch_C : <path> Comment for a sketch of reads (
kkonganti@11	157 instead of first sequence comment). Default
kkonganti@11	158 : false
kkonganti@11	159
kkonganti@11	160 --mashsketch_k : <int> K-mer size. Hashes will be based on
kkonganti@11	161 strings of this many nucleotides.
kkonganti@11	162 Canonical nucleotides are used by default (
kkonganti@11	163 see Alphabet options below). (1-32) Default
kkonganti@11	164 : 21
kkonganti@11	165
kkonganti@11	166 --mashsketch_s : <int> Sketch size. Each sketch will have
kkonganti@11	167 at most this many non-redundant min-hashes
kkonganti@11	168 . Default: 1000
kkonganti@11	169
kkonganti@11	170 --mashsketch_i : Sketch individual sequences, rather than
kkonganti@11	171 whole files, e.g. for multi-fastas of
kkonganti@11	172 single-chromosome genomes or pair-wise gene
kkonganti@11	173 comparisons. Default: false
kkonganti@11	174
kkonganti@11	175 --mashsketch_S : <int> Seed to provide to the hash
kkonganti@11	176 function. (0-4294967296) [42] Default:
kkonganti@11	177 false
kkonganti@11	178
kkonganti@11	179 --mashsketch_w : <num> Probability threshold for warning
kkonganti@11	180 about low k-mer size. (0-1) Default: false
kkonganti@11	181
kkonganti@11	182 --mashsketch_r : Input is a read set. See Reads options
kkonganti@11	183 below. Incompatible with --mashsketch_i.
kkonganti@11	184 Default: false
kkonganti@11	185
kkonganti@11	186 --mashsketch_b : <size> Use a Bloom filter of this size (
kkonganti@11	187 raw bytes or with K/M/G/T) to filter out
kkonganti@11	188 unique k-mers. This is useful if exact
kkonganti@11	189 filtering with --mashsketch_m uses too much
kkonganti@11	190 memory. However, some unique k-mers may
kkonganti@11	191 pass erroneously, and copies cannot be
kkonganti@11	192 counted beyond 2. Implies --mashsketch_r.
kkonganti@11	193 Default: false
kkonganti@11	194
kkonganti@11	195 --mashsketch_m : <int> Minimum copies of each k-mer
kkonganti@11	196 required to pass noise filter for reads.
kkonganti@11	197 Implies --mashsketch_r. Default: false
kkonganti@11	198
kkonganti@11	199 --mashsketch_c : <num> Target coverage. Sketching will
kkonganti@11	200 conclude if this coverage is reached before
kkonganti@11	201 the end of the input file (estimated by
kkonganti@11	202 average k-mer multiplicity). Implies --
kkonganti@11	203 mashsketch_r. Default: false
kkonganti@11	204
kkonganti@11	205 --mashsketch_g : <size> Genome size (raw bases or with K/M/
kkonganti@11	206 G/T). If specified, will be used for p-
kkonganti@11	207 value calculation instead of an estimated
kkonganti@11	208 size from k-mer content. Implies --
kkonganti@11	209 mashsketch_r. Default: false
kkonganti@11	210
kkonganti@11	211 --mashsketch_n : Preserve strand (by default, strand is
kkonganti@11	212 ignored by using canonical DNA k-mers,
kkonganti@11	213 which are alphabetical minima of forward-
kkonganti@11	214 reverse pairs). Implied if an alphabet is
kkonganti@11	215 specified with --mashsketch_a or --
kkonganti@11	216 mashsketch_z. Default: false
kkonganti@11	217
kkonganti@11	218 --mashsketch_a : Use amino acid alphabet (A-Z, except BJOUXZ
kkonganti@11	219 ). Implies --mashsketch_n --mashsketch_k 9
kkonganti@11	220 . Default: false
kkonganti@11	221
kkonganti@11	222 --mashsketch_z : <text> Alphabet to base hashes on (case
kkonganti@11	223 ignored by default; see --mashsketch_Z). K-
kkonganti@11	224 mers with other characters will be ignored
kkonganti@11	225 . Implies --mashsketch_n. Default: false
kkonganti@11	226
kkonganti@11	227 --mashsketch_Z : Preserve case in k-mers and alphabet (case
kkonganti@11	228 is ignored by default). Sequence letters
kkonganti@11	229 whose case is not in the current alphabet
kkonganti@11	230 will be skipped when sketching. Default:
kkonganti@11	231 false
kkonganti@11	232
kkonganti@11	233 Help options :
kkonganti@11	234
kkonganti@11	235 --help : Display this message.
kkonganti@11	236
kkonganti@11	237 ```

Mercurial > repos > kkonganti > cfsan_bettercallsal

annotate 0.6.1/readme/bettercallsal_db.md @ 16:b90e5a7a3d4f