csp2: CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/bbmap-39.01-1/docs/ToolDescriptions.txt annotate

annotate CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/bbmap-39.01-1/docs/ToolDescriptions.txt @ 68:5028fdace37b

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d

author	jpayne
date	Tue, 18 Mar 2025 16:23:26 -0400
parents
children

rev	line source
jpayne@68	1 Concise descriptions of BBTools.
jpayne@68	2 For complete documentation of a specific tool, please see its shellscript, and its guide if available.
jpayne@68	3
jpayne@68	4
jpayne@68	5
jpayne@68	6 Note on threads:
jpayne@68	7
jpayne@68	8 Virtually all BBTools are multithreaded. If a description indicates that a tool is singlethreaded, that generally means there is only 1 worker thread. File input and output are usually in separate threads, so a "singlethreaded" program like ReformatReads may be observed using over 250% of the resources of a single core (in other words, 2.5 threads on average, with 1 input file and 1 output file). Programs listed as multithreaded, on the other hand, will automatically use all available threads (meaning the number of logical processors) unless restricted. Most multithreaded tools scale near-linearly with the number of cores up to at least 32.
jpayne@68	9
jpayne@68	10 Note on memory:
jpayne@68	11
jpayne@68	12 The memory usage classification of "low" or "high" is based on assumptions; with the exception of AssemblyStats (which uses a fixed amount of memory), the actual amount of memory needed varies based on the parameters and input files. While all programs can be forced to use a specific amount of memory with the -Xmx flag, the tools classified as low memory will try to grab only a small amount of memory by default when run via the shellscript, while the ones listed as high memory will try to grab all available memory.
jpayne@68	13
jpayne@68	14
jpayne@68	15 Alignment and Coverage-Related
jpayne@68	16
jpayne@68	17 Name: align2.BBMap
jpayne@68	18 Shellscript: bbmap.sh, removehuman.sh, removehuman2.sh, mapnt.sh
jpayne@68	19 Description: Fast and accurate splice-aware read aligner for DNA and RNA. Finds optimal global alignments. Maximum read length is 600bp.
jpayne@68	20 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference; roughly 6 bytes per base, or 3 bytes per base with the flag "usemodulo".
jpayne@68	21 Additional Shellscripts: removehuman.sh calls BBMap with a prebuilt index and parameters designed to remove human contamination with zero false-positives; removehuman2.sh is designed to minimize false-negatives at the expense of allowing some false-positives. mapnt.sh calls BBMap with a prebuilt index and parameters designed to allow mapping to nt while running on a 120GB node. All of these are designed exclusively for Genepool and will not function elsewhere, so should not be distributed outside LBL.
jpayne@68	22
jpayne@68	23 Name: align2.BBMapPacBio
jpayne@68	24 Shellscript: mapPacBio.sh
jpayne@68	25 Description: Version of BBMap for long reads up to 6kbp. Designed for PacBio and Nanopore reads; uses alignment penalties weighted for PacBio's error model.
jpayne@68	26 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference and number of threads.
jpayne@68	27
jpayne@68	28 Name: align2.BBMapPacBioSkimmer
jpayne@68	29 Shellscript: bbmapskimmer.sh
jpayne@68	30 Description: Version of BBMap for mapping reads to all sites above a certain score threshold, rather than finding the single best mapping location. Uses alignment penalties weighted for PacBio's error model, as it was originally created to map Illumina reads to PacBio reads for error-correction.
jpayne@68	31 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference and number of threads.
jpayne@68	32
jpayne@68	33 Name: align2.BBSplit
jpayne@68	34 Shellscript: bbsplit.sh
jpayne@68	35 Description: Uses BBMap to map to multiple references simultaneously, and output one file per reference, containing all the reads that match it better than the other references. Used for metagenomic binning, distinguishing between closely-related organisms, and contamination removal.
jpayne@68	36 Notes: See BBMap.
jpayne@68	37
jpayne@68	38 Name: align2.BBWrap
jpayne@68	39 Shellscript: bbwrap.sh
jpayne@68	40 Description: Allows multiple runs of BBMap on different input files without reloading the reference. Useful when the reference is very large.
jpayne@68	41 Notes: See BBMap.
jpayne@68	42
jpayne@68	43 Name: jgi.CoveragePileup
jpayne@68	44 Shellscript: pileup.sh
jpayne@68	45 Description: Calculates coverage information from an unsorted or sorted sam or bam file. Outputs per-scaffold coverage, per-base coverage, binned coverage, normalized coverage, per-ORF coverage (using PRODIGAL's format), coverage histograms, stranded coverage, physical coverage, FPKMs, and various others.
jpayne@68	46 Notes: Singlethreaded, high memory. TODO: Would not be overly difficult to make a multithreaded version using A_SampleMT, but would require locks or queues.
jpayne@68	47
jpayne@68	48 Name: driver.SummarizeCoverage
jpayne@68	49 Shellscript: summarizescafstats.sh
jpayne@68	50 Description: Summarizes the scafstats output of BBMap for evaluation of cross-contamination. The intended use is to map multiple libraries or assemblies, of different multiplexed organisms, to a concatenated reference containing one fused scaffold per organism. This will convert all of the resulting stats files (one per library) to a single text file, with multiple columns, indicating how much of the input hit the primary versus nonprimary scaffolds. See also BBMap, Pileup, SummarizeSealStats.
jpayne@68	51 Notes: Singlethreaded, low memory.
jpayne@68	52
jpayne@68	53 Name: jgi.FilterByCoverage
jpayne@68	54 Shellscript: filterbycoverage.sh.
jpayne@68	55 Description: Filters an assembly by contig coverage, to remove contigs below a coverage cutoff, or with fewer than some percent of their bases covered. Uses coverage stats produced by BBMap or Pileup.
jpayne@68	56 Notes: Singlethreaded, low memory.
jpayne@68	57
jpayne@68	58 Name: driver.MergeCoverageOTU
jpayne@68	59 Shellscript: mergeOTUs.sh
jpayne@68	60 Description: Merges coverage stats lines (from Pileup) for the same OTU, according to some custom naming scheme. See also CoveragePileup.
jpayne@68	61 Notes: Singlethreaded, low memory.
jpayne@68	62
jpayne@68	63 Name: jgi.SamToEst
jpayne@68	64 Shellscript: bbest.sh
jpayne@68	65 Description: Calculates EST (expressed sequence tags) capture by an assembly from a sam file. Designed to use BBMap output generated with these flags: k=13 maxindel=100000 customtag ordered
jpayne@68	66 Notes: Singlethreaded, low memory.
jpayne@68	67
jpayne@68	68 Name: assemble.Postfilter
jpayne@68	69 Shellscript: postfilter.sh
jpayne@68	70 Description: Maps reads, then filters an assembly by contig coverage. Intended to reduce misassembly rate of SPAdes by removing suspicious contigs. See also BBMap and FilterByCoverage.
jpayne@68	71 Notes: Multithreaded, high memory.
jpayne@68	72
jpayne@68	73
jpayne@68	74 Kmer Matching
jpayne@68	75
jpayne@68	76 Name: jgi.BBDukF
jpayne@68	77 Shellscript: bbduk.sh
jpayne@68	78 Description: Multipurpose tool for read preprocessing, which does adapter-trimming, quality-trimming, contaminant filtering, entropy filtering, sequence masking, quality score recalibration, format conversion, histogram generation, barcode filtering, gc filtering, kmer cardinality estimation, and many similar tasks.
jpayne@68	79 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference (roughly 20 bytes per kmer) and whether hdist or edist are set (they multiply memory consumption by a large factor); if no reference is loaded, little memory is needed.
jpayne@68	80
jpayne@68	81 Name: jgi.BBDuk2
jpayne@68	82 Shellscript: bbduk2.sh
jpayne@68	83 Description: Version of BBDuk that can do multiple kmer-based operations at once - left-trim, right-trim, filter, and mask.
jpayne@68	84 Notes: See BBDuk.
jpayne@68	85
jpayne@68	86 Name: jgi.Seal
jpayne@68	87 Shellscript: seal.sh
jpayne@68	88 Description: Performs high-speed alignment-free sequence quantification or binning, by counting the number of long kmers that match between a read and a set of reference sequences. Designed for RNA-seq versus a transcriptome, metagenomic binning and abundance analysis, quantifying contamination, and similar. Very similar to BBDuk except that Seal associates each kmer with multiple reference sequences instead of just one, so it is superior in situations where multiple reference sequences may share a kmer. Unlike BBSplit, this supports unlimited read length. Can generate per-scaffold coverage, FPKMs when mapping to a transcriptome, and so forth. Also supports taxonomic classification.
jpayne@68	89 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference (roughly 30 bytes per kmer) and whether hdist or edist are set (they multiply memory consumption by a large factor).
jpayne@68	90
jpayne@68	91 Name: driver.SummarizeSealStats
jpayne@68	92 Shellscript: summarizeseal.sh
jpayne@68	93 Description: Summarizes the stats output of Seal for evaluation of cross-contamination. The intended use is to map multiple libraries or assemblies, of different multiplexed organisms, to a concatenated reference containing one fused scaffold per organism. This will convert all of the resulting stats files (one per library) to a single text file, with multiple columns, indicating how much of the input hit the primary versus nonprimary scaffolds. Also allows filtering of certain libraries to mask some classes of contamination. Because Seal supports arbitrarily-long sequences, this is a better choice than BBMap for evaluating assemblies. See also Seal, SummarizeCoverage.
jpayne@68	94 Notes: Singlethreaded, low memory.
jpayne@68	95
jpayne@68	96
jpayne@68	97 Kmer Counting
jpayne@68	98
jpayne@68	99 Name: jgi.LogLog
jpayne@68	100 Shellscript: loglog.sh
jpayne@68	101 Description: Estimates the number of unique kmers within a dataset to within ~10%.
jpayne@68	102 Notes: Multithreaded, low memory. This can also be done with other programs such as BBDuk by adding the loglog flag.
jpayne@68	103
jpayne@68	104 Name: jgi.KmerCountExact
jpayne@68	105 Shellscript: kmercountexact.sh
jpayne@68	106 Description: Counts kmers in sequence data. Capable of outputting the kmers and their counts as fasta or 2-column tsv, as well as a frequency histogram. No kmer length limits.
jpayne@68	107 Notes: Multithreaded, high memory.
jpayne@68	108
jpayne@68	109 Name: jgi.KmerNormalize (generally referred to as BBNorm)
jpayne@68	110 Shellscript: bbnorm.sh, ecc.sh, khist.sh
jpayne@68	111 Description: Uses a lossy data structure (count-min sketch) to perform kmer-based normalization, error-correction, and/or depth-binning on reads.
jpayne@68	112 Notes: Multithreaded, high memory. BBNorm will never run out of memory; rather, as the amount of data increases, the accuracy decreases. Therefore you should always use all available memory for best accuracy. The error correction by Tadpole is superior, but Tadpole can run out of memory with large datasets.
jpayne@68	113 Additional Shellscripts: KmerNormalize is called by 3 different shellscripts, which differ only in their default parameters (which can be overridden). bbnorm.sh does 2-pass normalization only; ecc.sh does error-correction only; and khist.sh only makes a kmer histogram, without ignoring the low-quality kmers (as is done by ecc and bbnorm). But, if add the flag "ecc" to bbnorm.sh and it will do error-correction also, and so forth - with the same parameters they are all identical.
jpayne@68	114
jpayne@68	115 Name: jgi.CalcUniqueness
jpayne@68	116 Shellscript: bbcountunique.sh
jpayne@68	117 Description: Generates a kmer uniqueness histogram, binned by file position. Designed to analyze library complexity, and determine how much sequencing is needed before reaching saturation. Outputs both single-read uniqueness and pair uniqueness.
jpayne@68	118 Notes: Singlethreaded, high memory (around 100 bytes per read pair).
jpayne@68	119
jpayne@68	120 Name: jgi.SmallKmerFrequency
jpayne@68	121 Shellscript: commonkmers.sh
jpayne@68	122 Description: Prints the most common kmers in a sequence, their counts, and the sequence header. K is limited to 15.
jpayne@68	123 Notes: Singlethreaded, low memory. Memory is proportional to 4^k, and is trivial for short kmers under 10.
jpayne@68	124
jpayne@68	125 Name: jgi.KmerCoverage
jpayne@68	126 Shellscript: kmercoverage.sh
jpayne@68	127 Description: Annotates reads with their kmer depth.
jpayne@68	128 Notes: Deprecated. Multithreaded, high memory.
jpayne@68	129
jpayne@68	130 Name: jgi.CallPeaks
jpayne@68	131 Shellscript: callpeaks.sh
jpayne@68	132 Description: Calls peaks from a kmer frequency histogram, such as that from BBNorm or KmerCountExact. Also estimates genome size and other statistics.
jpayne@68	133 Notes: Singlethreaded, low memory. Normally called automatically by programs that make the histogram. The peak-calling logic is not very sophisticated and could be improved.
jpayne@68	134
jpayne@68	135
jpayne@68	136 Assembly
jpayne@68	137
jpayne@68	138 Name: assemble.Tadpole
jpayne@68	139 Shellscript: tadpole.sh
jpayne@68	140 Description: Very fast kmer-based assembler, designed for haploid organisms. Performs well on single cells, viruses, organelles, and in other situations with small genomes and potentially uneven or very high coverage. Also has modes for read error-correction and extension, instead of assembly; Tadpole's error-correction is superior to BBNorm's. No upper limit on kmer length. See also KmerCountExact, KmerCompressor, LogLog, BBMerge, KmerNormalize.
jpayne@68	141 Notes: Multithreaded, high memory. Memory consumption is a strict function of the number of unique input kmers.
jpayne@68	142
jpayne@68	143 Name: assemble.TadpoleWrapper
jpayne@68	144 Shellscript: tadwrapper.sh
jpayne@68	145 Description: Generates multiple assemblies with Tadpole to estimate the optimal kmer length.
jpayne@68	146 Notes: Multithreaded, high memory.
jpayne@68	147
jpayne@68	148 Name: assemble.KmerCompressor
jpayne@68	149 Shellscript: kcompress.sh
jpayne@68	150 Description: Generates a minimal fasta file containing each kmer from the input sequence exactly once. Optionally allows the inclusion only of kmers within a certain depth range. Arbitrary kmer set operations are possible via multiple passes. Very similar to an assembler.
jpayne@68	151 Notes: Multithreaded, high memory. Contains a singlethreaded phase.
jpayne@68	152
jpayne@68	153 Name: jgi.AssemblyStats2
jpayne@68	154 Shellscript: stats.sh
jpayne@68	155 Description: Generates basic assembly statistics such as scaffold count, N50, L50, GC content, gap percent, etc. Also generates per-scaffold length and base content statistics, and can estimate BBMap's memory requirements for an assembly. See also StatsWrapper.
jpayne@68	156 Notes: Singlethreaded, low memory.
jpayne@68	157
jpayne@68	158 Name: jgi.AssemblyStatsWrapper
jpayne@68	159 Shellscript: statswrapper.sh
jpayne@68	160 Description: Generates stats on multiple assemblies, allowing tab-delimited columns with one assembly per row, and only one header.
jpayne@68	161 Notes: Singlethreaded, low memory.
jpayne@68	162
jpayne@68	163 Name: jgi.CountGC
jpayne@68	164 Shellscript: countgc.sh
jpayne@68	165 Description: Counts GC content of reads or scaffolds.
jpayne@68	166 Notes: Deprecated; superceded by AssemblyStats.
jpayne@68	167
jpayne@68	168 Name: jgi.FungalRelease
jpayne@68	169 Shellscript: fungalrelease.sh
jpayne@68	170 Description: Reformats a fungal assembly for release. Also creates contig and agp files.
jpayne@68	171 Notes: Singlethreaded, low memory.
jpayne@68	172
jpayne@68	173
jpayne@68	174 Taxonomy
jpayne@68	175
jpayne@68	176 Name: tax.FilterByTaxa
jpayne@68	177 Shellscript: filterbytaxa.sh
jpayne@68	178 Description: Filters sequences according to their taxonomy, as determined by the sequence name. Sequences should be labeled with a gi number, NCBI taxID, or species name. Relies on NCBI taxdump processed using taxtree.sh and gitable.sh.
jpayne@68	179 Notes: Singlethreaded, low memory.
jpayne@68	180
jpayne@68	181 Name: tax.RenameGiToNcbi
jpayne@68	182 Shellscript: gi2taxid.sh
jpayne@68	183 Description: Renames sequences with gi numbers to NCBI taxa IDs. This allows taxonomy processing without a gi number lookup.
jpayne@68	184 Notes: Singlethreaded, high memory. TODO: Can be made low memory if slightly altered to accept gitable.int1d files.
jpayne@68	185
jpayne@68	186 Name: tax.GiToNcbi
jpayne@68	187 Shellscript: gitable.sh
jpayne@68	188 Description: Condenses gi_taxid_nucl.dmp from NCBI taxdmp to gitable.int1d, a more efficient representation, used by other tools for translating gi numbers to taxID's. See also TaxTree.
jpayne@68	189 Notes: Singlethreaded, high memory.
jpayne@68	190
jpayne@68	191 Name: tax.SortByTaxa
jpayne@68	192 Shellscript: sortbytaxa.sh
jpayne@68	193 Description: Sorts sequences into taxonomic order by some depth-first traversal of the Tree of Life as defined by NCBI taxdump. Sequences must be labelled with taxonomic identifiers.
jpayne@68	194 Notes: Singlethreaded, high memory.
jpayne@68	195
jpayne@68	196 Name: tax.SplitByTaxa
jpayne@68	197 Shellscript: splitbytaxa.sh
jpayne@68	198 Description: Splits sequences according to their taxonomy, as determined by the sequence name. Sequences should be labeled with a gi number, NCBI taxID, or species name.
jpayne@68	199 Notes: Multithreaded, high memory. If the number of threads is restricted and the sequences are fairly short, regardless of the total number, this may be run using low memory.
jpayne@68	200
jpayne@68	201 Name: tax.PrintTaxonomy
jpayne@68	202 Shellscript: taxonomy.sh
jpayne@68	203 Description: Prints the full taxonomy of a given taxonomic identifier (such as homo_sapiens).
jpayne@68	204 Notes: Singlethreaded, low memory.
jpayne@68	205
jpayne@68	206 Name: tax.TaxTree
jpayne@68	207 Shellscript: taxtree.sh
jpayne@68	208 Description: Creates tree.taxtree from names.dmp and nodes.dmp, which are in NCBI tax dump. The taxtree file is needed for programs that can deal with taxonomy, like Seal and SortByTaxa.
jpayne@68	209 Notes: Singlethreaded, high memory.
jpayne@68	210
jpayne@68	211 Name: driver.ReduceSilva
jpayne@68	212 Shellscript: reducesilva.sh
jpayne@68	213 Description: Reduces Silva entries down to one entry per specified taxonomic level. Designed to increase the efficiency of operations like mapping, in which having thousands of substrains represented are not helpful.
jpayne@68	214 Notes: Singlethreaded, low memory.
jpayne@68	215
jpayne@68	216
jpayne@68	217 Cross-Contamination
jpayne@68	218
jpayne@68	219 Name: jgi.SynthMDA
jpayne@68	220 Shellscript: synthmda.sh
jpayne@68	221 Description: Generates synthetic reads following an MDA-amplified single cell's coverage distribution. Designed for single-cell assembly and analysis optimization. See also CrossContaminate, RandomReads.
jpayne@68	222 Notes: Singlethreaded, medium memory (needs around 4GB).
jpayne@68	223
jpayne@68	224 Name: jgi.CrossContaminate
jpayne@68	225 Shellscript: crosscontaminate.sh
jpayne@68	226 Description: Generates synthetic cross-contaminated files from clean files. Intended for use with synthetic reads generated by SynthMDA or RandomReads. Designed to evaluate the effects of cross-contamination on assembly, and the efficacy of decontamination methods.
jpayne@68	227 Notes: Singlethreaded, high memory.
jpayne@68	228
jpayne@68	229 Name: jgi.DecontaminateByNormalization
jpayne@68	230 Shellscript: decontaminate.sh, crossblock.sh
jpayne@68	231 Description: Removes contaminant contigs from assemblies of multiplexed libraries via normalization and mapping.
jpayne@68	232 Notes: Multithreaded, high memory. Mostly a wrapper for other programs like BBMap, BBNorm, and FilterByCoverage.
jpayne@68	233
jpayne@68	234
jpayne@68	235 Deduplication and Clustering
jpayne@68	236
jpayne@68	237 Name: jgi.Dedupe
jpayne@68	238 Shellscript: dedupe.sh
jpayne@68	239 Description: Accepts one or more files containing sets of sequences (reads or scaffolds). Removes duplicate sequences, which may be specified to be exact matches, fully contained subsequences, or subsequence within some edit distance. Can also find overlapping sequences and group them into clusters based on transitive reachability; for example, clustering full-length 16S PacBio reads by species.
jpayne@68	240 Notes: Multithreaded, high memory. This program has a jni mode which increases speed dramatically if an edit distance is used.
jpayne@68	241
jpayne@68	242 Name: jgi.Dedupe2
jpayne@68	243 Shellscript: dedupe2.sh
jpayne@68	244 Description: Allows more kmer seeds than Dedupe. This will be automatically called by Dedupe if needed.
jpayne@68	245 Notes: See Dedupe.
jpayne@68	246
jpayne@68	247 Name: jgi.DedupeByMapping
jpayne@68	248 Shellscript: dedupebymapping.sh
jpayne@68	249 Description: Removes duplicate reads or read pairs from a sam/bam file based on mapping coordinates. The sam file does not need to be sorted.
jpayne@68	250 Notes: Singlethreaded, high memory.
jpayne@68	251
jpayne@68	252 Name: clump.Clumpify
jpayne@68	253 Shellscript: clumpify.sh
jpayne@68	254 Description: Rearranges unsorted reads into small clumps of reads, such that each clump shares a kmer, and thus probably overlaps. Can also create consensus sequence from these clumps.
jpayne@68	255 Notes: Multithreaded, low or high memory. Memory consumption may be made arbitrarily small by using a user-specified number of temp files for bucket-sorting. By default, it will try to grab all available memory.
jpayne@68	256
jpayne@68	257
jpayne@68	258 Read Merging
jpayne@68	259
jpayne@68	260 Name: jgi.BBMerge
jpayne@68	261 Shellscript: bbmerge.sh, bbmerge-auto.sh
jpayne@68	262 Description: Merges paired reads into single reads by overlap detection. With sufficient coverage, can also merge nonoverlapping reads by kmer extension.
jpayne@68	263 Notes: Multithreaded, low memory. If kmers are used (for extension or error-correction), it will need much more memory, and the shellscript bbmerge-auto.sh should be used, which tries to acquire all available RAM. This program has a jni mode which increases speed by around 20%.
jpayne@68	264
jpayne@68	265 Name: jgi.MateReadsMT
jpayne@68	266 Shellscript: bbmergegapped.sh
jpayne@68	267 Description: Uses gapped kmers to merge nonoverlapping reads.
jpayne@68	268 Notes: Deprecated; superceded by BBMerge.
jpayne@68	269
jpayne@68	270
jpayne@68	271 Synthetic Read Generation and Benchmarking
jpayne@68	272
jpayne@68	273 Name: align2.RandomReads3
jpayne@68	274 Shellscript: randomreads.sh
jpayne@68	275 Description: Generates random synthetic reads from a reference genome, annotated with their genomic origin. Allows precise customization of things like insert size and synthetic mutation type, sizes, and rates. Read names are parsed by various other BBTools to grade accuracy.
jpayne@68	276 Notes: Singlethreaded, high memory.
jpayne@68	277
jpayne@68	278 Name: jgi.FakeReads
jpayne@68	279 Shellscript: bbfakereads.sh
jpayne@68	280 Description: Generates fake read pairs from ends of contigs or single reads. Intended for use in generating a fake LMP library for scaffolding, using additional information like another assembly, or very long reads (like PacBio). This can also be accomplished with RandomReads.
jpayne@68	281 Notes: Singlethreaded, low memory.
jpayne@68	282
jpayne@68	283 Name: align2.GradeSamFile
jpayne@68	284 Shellscript: gradesam.sh
jpayne@68	285 Description: Grades the accuracy of an aligner (such as BBMap) by parsing the output. The reads must be single-ended and annotated as though generated by RandomReads.
jpayne@68	286 Notes: Singlethreaded, low memory.
jpayne@68	287
jpayne@68	288 Name: align2.MakeRocCurve
jpayne@68	289 Shellscript: samtoroc.sh
jpayne@68	290 Description: Creates an ROC plot (technically, true-positive versus false-positive) from a sam or bam file of mapped reads. The reads should be single-ended with headers generated by RandomReads.
jpayne@68	291 Notes: Singlethreaded, low memory.
jpayne@68	292
jpayne@68	293 Name: jgi.AddAdapters
jpayne@68	294 Shellscript: addadapters.sh
jpayne@68	295 Description: Randomly adds adapters to a file, or grades a trimmed file. The input is a set of reads, paired or unpaired. The output is those same reads with adapter sequence replacing some of the bases in some reads. For paired reads, adapters are located in the same position in read1 and read2. This is designed for benchmarking adapter-trimming software (such as BBDuk), and evaluating methodology. Adapters can alternately be added by RandomReads, in which case insert size is used to determine where the adapters go.
jpayne@68	296 Notes: Singlethreaded, low memory.
jpayne@68	297
jpayne@68	298 Name: jgi.GradeMergedReads
jpayne@68	299 Shellscript: grademerge.sh
jpayne@68	300 Description: Grades the accuracy of a read-merging program (such as BBMerge) by parsing the output. The reads must be annotated by their insert size. This can be done by generating them with RandomReads and renaming with RenameReads
jpayne@68	301 Notes: Singlethreaded, low memory.
jpayne@68	302
jpayne@68	303 Name: align2.PrintTime
jpayne@68	304 Shellscript: printtime.sh
jpayne@68	305 Description: Prints time elapsed since last called on the same file.
jpayne@68	306 Notes: Singlethreaded, low memory.
jpayne@68	307
jpayne@68	308
jpayne@68	309 16S, Primers, and Amplicons
jpayne@68	310
jpayne@68	311 Name: jgi.FindPrimers
jpayne@68	312 Shellscript: msa.sh
jpayne@68	313 Description: Aligns a query sequence to reference sequences. Outputs the best matching position per reference sequence. If there are multiple queries, only the best-matching query will be used. Designed to find primer binding sites in a sequence that may contain indels, such as a PacBio read, using a MultiStateAligner.
jpayne@68	314 Notes: Singlethreaded, high memory. TODO: Could easily be made multithreaded using A_SampleMT.
jpayne@68	315
jpayne@68	316 Name: jgi.CutPrimers
jpayne@68	317 Shellscript: cutprimers.sh
jpayne@68	318 Description: Cuts out sequences corresponding to primers identified in sam files. Used in conjunction with FindPrimers (msa.sh).
jpayne@68	319 Notes: Singlethreaded, low memory.
jpayne@68	320
jpayne@68	321 Name: jgi.IdentityMatrix
jpayne@68	322 Shellscript: idmatrix.sh
jpayne@68	323 Description: Generates an identity matrix via all-to-all alignment of sequences in a file. Intended for 16S or other amplicon analysis. See also CorrelateIdentity.
jpayne@68	324 Notes: Multithreaded, high-memory. Time complexity is O(N^2) with the number of reads.
jpayne@68	325
jpayne@68	326 Name: driver.CorrelateIdentity
jpayne@68	327 Shellscript: matrixtocolumns.sh
jpayne@68	328 Description: Transforms two matched identity matrices into 2-column format, one row per entry, one column per matrix. Designed for comparing different 16S subregions. See also IdentityMatrix, FindPrimers.
jpayne@68	329 Notes: Singlethreaded, high memory. The actual amount of memory just depends on the matrix sizes.
jpayne@68	330
jpayne@68	331
jpayne@68	332 Barcodes
jpayne@68	333
jpayne@68	334 Name: jgi.CountBarcodes
jpayne@68	335 Shellscript: countbarcodes.sh
jpayne@68	336 Description: Counts the number of reads with each barcode. Assumes read names have the barcode at the end.
jpayne@68	337 Notes: Singlethreaded, low memory.
jpayne@68	338
jpayne@68	339 Name: jgi.CorrelateBarcodes
jpayne@68	340 Shellscript: filterbarcodes.sh
jpayne@68	341 Description: Filters barcodes by quality, and generates quality histograms. See also MergeBarcodes.
jpayne@68	342 Notes: Singlethreaded, low memory.
jpayne@68	343
jpayne@68	344 Name: jgi.MergeBarcodes
jpayne@68	345 Shellscript: mergebarcodes.sh
jpayne@68	346 Description: Concatenates barcodes and barcode quality onto read names. Designed to analyze the effects of barcode quality on library misassignment. See also CorrelateBarcodes.
jpayne@68	347 Notes: Singlethreaded, low memory.
jpayne@68	348
jpayne@68	349 Name: jgi.RemoveBadBarcodes
jpayne@68	350 Shellscript: removebadbarcodes.sh
jpayne@68	351 Description: Removes reads with improper barcodes - either with no barcode, or a barcode containing a degenerate base.
jpayne@68	352 Notes: Singlethreaded, low memory. Mostly a test case for extending BBTool_ST.
jpayne@68	353
jpayne@68	354
jpayne@68	355 Filtering and Demultiplexing
jpayne@68	356
jpayne@68	357 Name: jgi.DemuxByName
jpayne@68	358 Shellscript: demuxbyname.sh
jpayne@68	359 Description: Demultiplexes reads into multiple files based on their name, by matching a suffix or prefix.
jpayne@68	360 Notes: Singlethreaded, low memory.
jpayne@68	361
jpayne@68	362 Name: jgi.FilterBySequence
jpayne@68	363 Shellscript: filterbysequence.sh
jpayne@68	364 Description: Filters reads by exact sequence match. Allows case-sensitive or insensitive matches, and reverse-complement matches or only forward matches.
jpayne@68	365 Notes: Multithreaded, high memory.
jpayne@68	366
jpayne@68	367 Name: driver.FilterReadsByName
jpayne@68	368 Shellscript: filterbyname.sh
jpayne@68	369 Description: Filters reads by name. Allows substring matching, though that is much slower.
jpayne@68	370 Notes: Singlethreaded, low memory.
jpayne@68	371
jpayne@68	372 Name: jgi.FilterReadsWithSubs
jpayne@68	373 Shellscript: filtersubs.sh
jpayne@68	374 Description: Filters a sam file to select only reads with substitution errors for bases with quality scores in a certain interval. Used for manually examining specific reads that may contain incorrectly calibrated quality scores.
jpayne@68	375 Notes: Singlethreaded, low memory.
jpayne@68	376
jpayne@68	377 Name: jgi.GetReads
jpayne@68	378 Shellscript: getreads.sh
jpayne@68	379 Description: Fetches the reads with specified numeric IDs (unrelated to their names). The first read (or pair) in a file has ID 0, the second read (or pair) has ID 1, etc.
jpayne@68	380 Notes: Singlethreaded, low memory.
jpayne@68	381
jpayne@68	382 Name: driver.EstherFilter
jpayne@68	383 Shellscript: estherfilter.sh
jpayne@68	384 Description: BLASTs queries against reference, and filters out hits with scores less than 'cutoff'.
jpayne@68	385 Notes: All the work is done by blastall, which dictates the performance characteristics.
jpayne@68	386
jpayne@68	387
jpayne@68	388 JGI-Exclusive Preprocessing Wrappers
jpayne@68	389
jpayne@68	390 Name: jgi.BBQC
jpayne@68	391 Shellscript: bbqc.sh
jpayne@68	392 Description: Wrapper for various read preprocessing operations.
jpayne@68	393 Notes: Deprecated; superceded by RQCFilter. Designed exclusively for Genepool and will not function elsewhere, so should not be distributed outside LBL.
jpayne@68	394
jpayne@68	395 Name: jgi.RQCFilter
jpayne@68	396 Shellscript: rqcfilter.sh
jpayne@68	397 Description: Acts as a wrapper/pipeline for read preprocessing. Performs quality-trimming, artifact removal, linker-trimming, adapter trimming, spike-in removal, vertebrate contaminant removal, microbial contaminant removal, and generates various histogram and statistics files used by RQC.
jpayne@68	398 Notes: Multithreaded, high memory. Currently requires 39500m RAM and thus can run on a 40G node, but it's recommended to submit it exclusive, as all stages are fully multithreaded. Designed exclusively for Genepool and will not function elsewhere, so should not be distributed outside LBL.
jpayne@68	399
jpayne@68	400
jpayne@68	401 Shredding and Sorting
jpayne@68	402
jpayne@68	403 Name: jgi.Shred
jpayne@68	404 Shellscript: shred.sh
jpayne@68	405 Description: Shreds long sequences into shorter sequences, with overlap length and variable-length options. See also Fuse.
jpayne@68	406 Notes: Singlethreaded, low memory.
jpayne@68	407
jpayne@68	408 Name: jgi.FuseSequence
jpayne@68	409 Shellscript: fuse.sh
jpayne@68	410 Description: Fuses sequences together, padding junctions with Ns. Does not support total length greater than 2 billion. Designed for use with Seal or BBDuk to make kmer tracking for a given genome more efficient. See also Shred.
jpayne@68	411 Notes: Singlethreaded, high memory.
jpayne@68	412
jpayne@68	413 Name: jgi.Shuffle
jpayne@68	414 Shellscript: shuffle.sh
jpayne@68	415 Description: Reorders reads randomly, keeping pairs together. Also supports some sorting operations, like alphabetically by name or by sequence.
jpayne@68	416 Notes: Singlethreaded, high memory. All operations are in-memory.
jpayne@68	417
jpayne@68	418
jpayne@68	419 Non-Sequence-Related
jpayne@68	420
jpayne@68	421 Name: Calcmem - Shellscript Only
jpayne@68	422 Shellscript: calcmem.sh
jpayne@68	423 Description: Calculates available memory for other shellscripts. Designed for Genepool but works fine on many Linux configurations.
jpayne@68	424 Notes: If java is being killed for allocating too much memory, this is the script to fix.
jpayne@68	425
jpayne@68	426 Name: fileIO.TextFile
jpayne@68	427 Shellscript: textfile.sh
jpayne@68	428 Description: Displays contents of a text file, optionally between a start and stop line. Useful mainly in Windows where there are few command-line utilities.
jpayne@68	429 Notes: Singlethreaded, low memory.
jpayne@68	430
jpayne@68	431 Name: driver.CountSharedLines
jpayne@68	432 Shellscript: countsharedlines.sh
jpayne@68	433 Description: Counts the number of lines shared between sets of files. One output file will be printed for each input file. For example, an output file for a file in the 'in1' set will contain one line per file in the 'in2' set, indicating how many lines are shared. This is not designed for sequence data, but more for things like sequence names or organism names. See filterlines.sh for actually filtering shared lines in a more normal fashion.
jpayne@68	434 Notes: Singlethreaded, low memory.
jpayne@68	435
jpayne@68	436 Name: driver.FilterLines
jpayne@68	437 Shellscript: filterlines.sh
jpayne@68	438 Description: Filters lines by exact match or substring. This is not designed for sequence data, but for things like sequence names or organism names.
jpayne@68	439 Notes: Singlethreaded, low memory.
jpayne@68	440
jpayne@68	441
jpayne@68	442 Other Tools
jpayne@68	443
jpayne@68	444 Name: jgi.A_SampleMT
jpayne@68	445 Shellscript: a_sample_mt.sh
jpayne@68	446 Description: Does nothing. Serves as a template for easily making new BBTools by dropping in code.
jpayne@68	447 Notes: Multithreaded, high memory. Be sure to modify the shellscript line " freeRam 4000m 84" as needed. The first is the amount of memory used if available memory cannot be calculated, the second is the percentage of free memory to use if it can be calculated.
jpayne@68	448
jpayne@68	449 Name: jgi.BBMask
jpayne@68	450 Shellscript: bbmask.sh
jpayne@68	451 Description: Masks sequences of low-complexity, or containing repeat kmers, or covered by mapped reads. Used to make masked versions of human, cat, dog, and mouse genomes; these are used for filtering vertebrate contamination from fungal/plant/microbial data without risk of false-positive removals.
jpayne@68	452 Notes: Multithreaded, high memory. Uses around 2 bytes per reference base.
jpayne@68	453
jpayne@68	454 Name: jgi.CalcTrueQuality
jpayne@68	455 Shellscript: calctruequality.sh
jpayne@68	456 Description: Generates matrices used for quality-score recalibration. Requires one or more mapped sam files as input. The actual recalibration is done with another program such as BBDuk.
jpayne@68	457 Notes: Multithreaded, low memory.
jpayne@68	458
jpayne@68	459 Name: jgi.MakeChimeras
jpayne@68	460 Shellscript: makechimeras.sh
jpayne@68	461 Description: Makes chimeric sequences by randomly fusing together nonchimeric sequences. Designed for analyzing chimera removal effectiveness.
jpayne@68	462 Notes: Singlethreaded, low memory.
jpayne@68	463
jpayne@68	464 Name: jgi.PhylipToFasta
jpayne@68	465 Shellscript: phylip2fasta.sh
jpayne@68	466 Description: Transforms interleaved phylip to fasta.
jpayne@68	467 Notes: Singlethreaded, high memory.
jpayne@68	468
jpayne@68	469 Name: jgi.MakeLengthHistogram
jpayne@68	470 Shellscript: readlength.sh
jpayne@68	471 Description: Makes a length histogram of sequences.
jpayne@68	472 Notes: Singlethreaded, low memory. Can also be accomplished with Reformat or BBDuk, but with less flexibility.
jpayne@68	473
jpayne@68	474 Name: jgi.ReformatReads
jpayne@68	475 Shellscript: reformat.sh
jpayne@68	476 Description: Reformats sequence data into another format, such as interleaved ASCII-33 fastq to twin-file ASCII-64. Also supports a huge collection of simple optional operations, like trimming, filtering, reverse-complementing, modifying read names, and modifying read sequence.
jpayne@68	477 Notes: Singlethreaded, low memory.
jpayne@68	478
jpayne@68	479 Name: pacbio.RemoveAdapters2
jpayne@68	480 Shellscript: removesmartbell.sh
jpayne@68	481 Description: Detects or removes SmartBell adapters from PacBio reads, by aligning the adapter using a customized version of the MultiStateAligner.
jpayne@68	482 Notes: Multithreaded, low memory.
jpayne@68	483
jpayne@68	484 Name: jgi.RenameReads
jpayne@68	485 Shellscript: rename.sh
jpayne@68	486 Description: Renames reads according to some specified prefix. Can also rename by insert size or mapping location.
jpayne@68	487 Notes: Singlethreaded, low memory.
jpayne@68	488
jpayne@68	489 Name: jgi.SplitPairsAndSingles
jpayne@68	490 Shellscript: repair.sh, bbsplitpairs.sh
jpayne@68	491 Description: Separates paired reads into files of pairs and singletons by removing reads that are shorter than a min length, or have no mate. Can also reorder arbitrarily-ordered reads in files where the pairing order was desynchronized. See also Reformat's vint flag.
jpayne@68	492 Notes: Singlethreaded, low or high memory. All operations are low-memory except reordering arbitrarily disordered files, which is optional.
jpayne@68	493
jpayne@68	494 Name: jgi.SplitNexteraLMP
jpayne@68	495 Shellscript: splitnextera.sh
jpayne@68	496 Description: Trims and splits Nextera LMP libraries into subsets based on linker orientation: LMP, fragment, unknown, and singleton.
jpayne@68	497 Notes: Singlethreaded, low memory. TODO: Should be reimplemented using A_SampleMT.
jpayne@68	498
jpayne@68	499 Name: jgi.SplitSamFile
jpayne@68	500 Shellscript: splitsam.sh
jpayne@68	501 Description: Splits a sam file into three files: Plus-mapped reads, Minus-mapped reads, and Unmapped.
jpayne@68	502 Notes: Singlethreaded, low memory.
jpayne@68	503
jpayne@68	504 Name: fileIO.FileFormat
jpayne@68	505 Shellscript: testformat.sh
jpayne@68	506 Description: Tests the format of a sequence-containing file. Determines format (fasta, fastq, etc), quality encoding, compression type, interleaving, and read length. All BBTools use this to determine how to process a file.
jpayne@68	507 Notes: Singlethreaded, low memory.
jpayne@68	508
jpayne@68	509 Name: jgi.TranslateSixFrames
jpayne@68	510 Shellscript: translate6frames.sh
jpayne@68	511 Description: Translates nucleotide sequences to all 6 amino acid frames, or amino acids to a canonical nucleotide representation.
jpayne@68	512 Notes: Singlethreaded, low memory.
jpayne@68	513
jpayne@68	514
jpayne@68	515 Template
jpayne@68	516
jpayne@68	517 Name:
jpayne@68	518 Shellscript:
jpayne@68	519 Description:
jpayne@68	520 Notes:

Mercurial > repos > rliterman > csp2

annotate CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/bbmap-39.01-1/docs/ToolDescriptions.txt @ 68:5028fdace37b