cfsan_bettercallsal: 0.7.0/readme/bettercallsal

annotate 0.7.0/readme/bettercallsal_db.md @ 18:75558ffe3e68

planemo upload

author	kkonganti
date	Mon, 15 Jul 2024 11:01:13 -0400
parents	0e7a0053e4a6
children

rev	line source
kkonganti@17	1 # bettercallsal_db
kkonganti@17	2
kkonganti@17	3 `bettercallsal_db` is an end-to-end automated workflow to generate and consolidate the required DB flat files based on [NCBI Pathogens Database for Salmonella](https://ftp.ncbi.nlm.nih.gov/pathogen/Results/Salmonella/). It first downloads the metadata based on the provided release identifier (Ex: `latest_snps` or `PDG000000002.2876`) and then creates a `mash sketch` based on the filtering strategy. It generates two types of sketches, one that prioritizes genome collection based on SNP clustering (`per_snp_cluster`) and the other just collects up to N number of genome accessions for each `computed_serotype` column from the metadata file (`per_computed_serotype`).
kkonganti@17	4
kkonganti@17	5 The `bettercallsal_db` workflow should finish within an hour with stable internet connection.
kkonganti@17	6
kkonganti@17	7 \
kkonganti@17	8
kkonganti@17	9
kkonganti@17	10 ## Workflow Usage
kkonganti@17	11
kkonganti@17	12 ```bash
kkonganti@17	13 cpipes --pipeline bettercallsal_db [options]
kkonganti@17	14 ```
kkonganti@17	15
kkonganti@17	16 \
kkonganti@17	17
kkonganti@17	18
kkonganti@17	19 Example: Run the `bettercallsal_db` pipeline and store output at `/data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876`.
kkonganti@17	20
kkonganti@17	21 ```bash
kkonganti@17	22 cpipes
kkonganti@17	23 --pipeline bettercallsal_db \
kkonganti@17	24 --pdg_release PDG000000002.2876 \
kkonganti@17	25 --output /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876
kkonganti@17	26 ```
kkonganti@17	27
kkonganti@17	28 \
kkonganti@17	29
kkonganti@17	30
kkonganti@17	31 Now you can run the `bettercallsal` workflow with the created database by mentioning the root path to the database with `--bcs_root_dbdir` option.
kkonganti@17	32
kkonganti@17	33 ```bash
kkonganti@17	34 cpipes
kkonganti@17	35 --pipeline bettercallsal \
kkonganti@17	36 --input /path/to/illumina/fastq/dir \
kkonganti@17	37 --output /path/to/output \
kkonganti@17	38 --bcs_root_dbdir /data/Kranti_Konganti/bettercallsal_db/PDG000000002.2876
kkonganti@17	39 ```
kkonganti@17	40
kkonganti@17	41 \
kkonganti@17	42
kkonganti@17	43
kkonganti@17	44 ## Note
kkonganti@17	45
kkonganti@17	46 Please note that the last step of the `bettercallsal_db` workflow named `SCAFFOLD_GENOMES` will spawn multiple processes and is not cached by Nextflow. This is an intentional setup for this specific stage of the workflow to speed up database creation and as such it is recommended that you run this workflow in a grid computing or similar cloud computing setting.
kkonganti@17	47
kkonganti@17	48 \
kkonganti@17	49
kkonganti@17	50
kkonganti@17	51 ## `bettercallsal_db` CLI Help
kkonganti@17	52
kkonganti@17	53 ```text
kkonganti@17	54 [Kranti_Konganti@my-unix-box ]$ cpipes --pipeline bettercallsal_db --help
kkonganti@17	55 N E X T F L O W ~ version 23.04.3
kkonganti@17	56 Launching `./bettercallsal/cpipes` [special_brenner] DSL2 - revision: 8da4e11078
kkonganti@17	57 ================================================================================
kkonganti@17	58 (o)
kkonganti@17	59 ___ _ __ _ _ __ ___ ___
kkonganti@17	60 / __\|\| '_ \ \| \|\| '_ \ / _ \/ __\|
kkonganti@17	61 \| (__ \| \|_) \|\| \|\| \|_) \|\| __/\__ \
kkonganti@17	62 \___\|\| .__/ \|_\|\| .__/ \___\|\|___/
kkonganti@17	63 \| \| \| \|
kkonganti@17	64 \|_\| \|_\|
kkonganti@17	65 --------------------------------------------------------------------------------
kkonganti@17	66 A collection of modular pipelines at CFSAN, FDA.
kkonganti@17	67 --------------------------------------------------------------------------------
kkonganti@17	68 Name : bettercallsal
kkonganti@17	69 Author : Kranti Konganti
kkonganti@17	70 Version : 0.7.0
kkonganti@17	71 Center : CFSAN, FDA.
kkonganti@17	72 ================================================================================
kkonganti@17	73
kkonganti@17	74 Workflow : bettercallsal_db
kkonganti@17	75
kkonganti@17	76 Author : Kranti Konganti
kkonganti@17	77
kkonganti@17	78 Version : 0.7.0
kkonganti@17	79
kkonganti@17	80
kkonganti@17	81 Required :
kkonganti@17	82
kkonganti@17	83 --output : Absolute path to directory where all the
kkonganti@17	84 pipeline outputs should be stored. Ex: --
kkonganti@17	85 output /path/to/output
kkonganti@17	86
kkonganti@17	87 Other options :
kkonganti@17	88
kkonganti@17	89 --wcomp_serocol : Column number (non 0-based index) of the
kkonganti@17	90 PDG metadata file by which the serotypes
kkonganti@17	91 are collected. Default: false
kkonganti@17	92
kkonganti@17	93 --wcomp_seronamecol : Column number (non 0-based index) of the
kkonganti@17	94 PDG metadata file whose column name is "
kkonganti@17	95 serovar". Default: false
kkonganti@17	96
kkonganti@17	97 --wcomp_acc_col : Column number (non 0-based index) of the
kkonganti@17	98 PDG metadata file whose column name is "acc
kkonganti@17	99 ". Default: false
kkonganti@17	100
kkonganti@17	101 --wcomp_target_acc_col : Column number (non 0-based index) of the
kkonganti@17	102 PDG metadata file whose column name is "
kkonganti@17	103 target_acc". Default: false
kkonganti@17	104
kkonganti@17	105 --wcomp_complete_sero : Skip indexing serotypes when the serotype
kkonganti@17	106 name in the column number 49 (non 0-based)
kkonganti@17	107 of PDG metadata file consists a "-". For
kkonganti@17	108 example, if an accession has a serotype=
kkonganti@17	109 string as such in column number 49 (non 0-
kkonganti@17	110 based): "serotype=- 13:z4,z23:-" then, the
kkonganti@17	111 indexing of that accession is skipped.
kkonganti@17	112 Default: false
kkonganti@17	113
kkonganti@17	114 --wcomp_not_null_serovar : Only index the computed_serotype column i.e
kkonganti@17	115 . column number 49 (non 0-based), if the
kkonganti@17	116 serovar column is not NULL. Default: false
kkonganti@17	117
kkonganti@17	118 --wcomp_i : Force include this serovar. Ignores --
kkonganti@17	119 wcomp_complete_sero for only this serovar.
kkonganti@17	120 Mention multiple serovars separated by a
kkonganti@17	121 ! (Exclamation mark). Ex: --
kkonganti@17	122 wcomp_complete_sero I 4,[5],12:i:-!Agona
kkonganti@17	123 Default: false
kkonganti@17	124
kkonganti@17	125 --wcomp_num : Number of genome accessions to be collected
kkonganti@17	126 per serotype. Default: false
kkonganti@17	127
kkonganti@17	128 --wcomp_min_contig_size : Minimum contig size to consider a genome
kkonganti@17	129 for indexing. Default: false
kkonganti@17	130
kkonganti@17	131 --wsnp_serocol : Column number (non 0-based index) of the
kkonganti@17	132 PDG metadata file by which the serotypes
kkonganti@17	133 are collected. Default: false
kkonganti@17	134
kkonganti@17	135 --wsnp_seronamecol : Column number (non 0-based index) of the
kkonganti@17	136 PDG metadata file whose column name is "
kkonganti@17	137 serovar". Default: false
kkonganti@17	138
kkonganti@17	139 --wsnp_acc_col : Column number (non 0-based index) of the
kkonganti@17	140 PDG metadata file whose column name is "acc
kkonganti@17	141 ". Default: false
kkonganti@17	142
kkonganti@17	143 --wsnp_target_acc_col : Column number (non 0-based index) of the
kkonganti@17	144 PDG metadata file whose column name is "
kkonganti@17	145 target_acc". Default: false
kkonganti@17	146
kkonganti@17	147 --wsnp_complete_sero : Skip indexing serotypes when the serotype
kkonganti@17	148 name in the column number 49 (non 0-based)
kkonganti@17	149 of PDG metadata file consists a "-". For
kkonganti@17	150 example, if an accession has a serotype=
kkonganti@17	151 string as such in column number 49 (non 0-
kkonganti@17	152 based): "serotype=- 13:z4,z23:-" then, the
kkonganti@17	153 indexing of that accession is skipped.
kkonganti@17	154 Default: true
kkonganti@17	155
kkonganti@17	156 --wsnp_not_null_serovar : Only index the computed_serotype column i.e
kkonganti@17	157 . column number 49 (non 0-based), if the
kkonganti@17	158 serovar column is not NULL. Default: false
kkonganti@17	159
kkonganti@17	160 --wsnp_i : Force include this serovar. Ignores --
kkonganti@17	161 wsnp_complete_sero for only this serovar.
kkonganti@17	162 Mention multiple serovars separated by a
kkonganti@17	163 ! (Exclamation mark). Ex: --
kkonganti@17	164 wsnp_complete_sero I 4,[5],12:i:-!Agona
kkonganti@17	165 Default: 'I 4,[5],12:i
kkonganti@17	166
kkonganti@17	167 --wsnp_num : Number of genome accessions to collect per
kkonganti@17	168 SNP cluster. Default: false
kkonganti@17	169
kkonganti@17	170 --mashsketch_run : Run `mash screen` tool. Default: true
kkonganti@17	171
kkonganti@17	172 --mashsketch_l : List input. Lines in each <input> specify
kkonganti@17	173 paths to sequence files, one per line.
kkonganti@17	174 Default: true
kkonganti@17	175
kkonganti@17	176 --mashsketch_I : <path> ID field for sketch of reads (
kkonganti@17	177 instead of first sequence ID). Default:
kkonganti@17	178 false
kkonganti@17	179
kkonganti@17	180 --mashsketch_C : <path> Comment for a sketch of reads (
kkonganti@17	181 instead of first sequence comment). Default
kkonganti@17	182 : false
kkonganti@17	183
kkonganti@17	184 --mashsketch_k : <int> K-mer size. Hashes will be based on
kkonganti@17	185 strings of this many nucleotides.
kkonganti@17	186 Canonical nucleotides are used by default (
kkonganti@17	187 see Alphabet options below). (1-32) Default
kkonganti@17	188 : 21
kkonganti@17	189
kkonganti@17	190 --mashsketch_s : <int> Sketch size. Each sketch will have
kkonganti@17	191 at most this many non-redundant min-hashes
kkonganti@17	192 . Default: 1000
kkonganti@17	193
kkonganti@17	194 --mashsketch_i : Sketch individual sequences, rather than
kkonganti@17	195 whole files, e.g. for multi-fastas of
kkonganti@17	196 single-chromosome genomes or pair-wise gene
kkonganti@17	197 comparisons. Default: false
kkonganti@17	198
kkonganti@17	199 --mashsketch_S : <int> Seed to provide to the hash
kkonganti@17	200 function. (0-4294967296) [42] Default:
kkonganti@17	201 false
kkonganti@17	202
kkonganti@17	203 --mashsketch_w : <num> Probability threshold for warning
kkonganti@17	204 about low k-mer size. (0-1) Default: false
kkonganti@17	205
kkonganti@17	206 --mashsketch_r : Input is a read set. See Reads options
kkonganti@17	207 below. Incompatible with --mashsketch_i.
kkonganti@17	208 Default: false
kkonganti@17	209
kkonganti@17	210 --mashsketch_b : <size> Use a Bloom filter of this size (
kkonganti@17	211 raw bytes or with K/M/G/T) to filter out
kkonganti@17	212 unique k-mers. This is useful if exact
kkonganti@17	213 filtering with --mashsketch_m uses too much
kkonganti@17	214 memory. However, some unique k-mers may
kkonganti@17	215 pass erroneously, and copies cannot be
kkonganti@17	216 counted beyond 2. Implies --mashsketch_r.
kkonganti@17	217 Default: false
kkonganti@17	218
kkonganti@17	219 --mashsketch_m : <int> Minimum copies of each k-mer
kkonganti@17	220 required to pass noise filter for reads.
kkonganti@17	221 Implies --mashsketch_r. Default: false
kkonganti@17	222
kkonganti@17	223 --mashsketch_c : <num> Target coverage. Sketching will
kkonganti@17	224 conclude if this coverage is reached before
kkonganti@17	225 the end of the input file (estimated by
kkonganti@17	226 average k-mer multiplicity). Implies --
kkonganti@17	227 mashsketch_r. Default: false
kkonganti@17	228
kkonganti@17	229 --mashsketch_g : <size> Genome size (raw bases or with K/M/
kkonganti@17	230 G/T). If specified, will be used for p-
kkonganti@17	231 value calculation instead of an estimated
kkonganti@17	232 size from k-mer content. Implies --
kkonganti@17	233 mashsketch_r. Default: false
kkonganti@17	234
kkonganti@17	235 --mashsketch_n : Preserve strand (by default, strand is
kkonganti@17	236 ignored by using canonical DNA k-mers,
kkonganti@17	237 which are alphabetical minima of forward-
kkonganti@17	238 reverse pairs). Implied if an alphabet is
kkonganti@17	239 specified with --mashsketch_a or --
kkonganti@17	240 mashsketch_z. Default: false
kkonganti@17	241
kkonganti@17	242 --mashsketch_a : Use amino acid alphabet (A-Z, except BJOUXZ
kkonganti@17	243 ). Implies --mashsketch_n --mashsketch_k 9
kkonganti@17	244 . Default: false
kkonganti@17	245
kkonganti@17	246 --mashsketch_z : <text> Alphabet to base hashes on (case
kkonganti@17	247 ignored by default; see --mashsketch_Z). K-
kkonganti@17	248 mers with other characters will be ignored
kkonganti@17	249 . Implies --mashsketch_n. Default: false
kkonganti@17	250
kkonganti@17	251 --mashsketch_Z : Preserve case in k-mers and alphabet (case
kkonganti@17	252 is ignored by default). Sequence letters
kkonganti@17	253 whose case is not in the current alphabet
kkonganti@17	254 will be skipped when sketching. Default:
kkonganti@17	255 false
kkonganti@17	256
kkonganti@17	257 Help options :
kkonganti@17	258
kkonganti@17	259 --help : Display this message.
kkonganti@17	260
kkonganti@17	261 ```

Mercurial > repos > kkonganti > cfsan_bettercallsal

annotate 0.7.0/readme/bettercallsal_db.md @ 18:75558ffe3e68