comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/bbmap-39.01-1/docs/ToolDescriptions.txt @ 68:5028fdace37b

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 16:23:26 -0400
parents
children
comparison
equal deleted inserted replaced
67:0e9998148a16 68:5028fdace37b
1 Concise descriptions of BBTools.
2 For complete documentation of a specific tool, please see its shellscript, and its guide if available.
3
4
5
6 Note on threads:
7
8 Virtually all BBTools are multithreaded. If a description indicates that a tool is singlethreaded, that generally means there is only 1 worker thread. File input and output are usually in separate threads, so a "singlethreaded" program like ReformatReads may be observed using over 250% of the resources of a single core (in other words, 2.5 threads on average, with 1 input file and 1 output file). Programs listed as multithreaded, on the other hand, will automatically use all available threads (meaning the number of logical processors) unless restricted. Most multithreaded tools scale near-linearly with the number of cores up to at least 32.
9
10 Note on memory:
11
12 The memory usage classification of "low" or "high" is based on assumptions; with the exception of AssemblyStats (which uses a fixed amount of memory), the actual amount of memory needed varies based on the parameters and input files. While all programs can be forced to use a specific amount of memory with the -Xmx flag, the tools classified as low memory will try to grab only a small amount of memory by default when run via the shellscript, while the ones listed as high memory will try to grab all available memory.
13
14
15 Alignment and Coverage-Related
16
17 Name: align2.BBMap
18 Shellscript: bbmap.sh, removehuman.sh, removehuman2.sh, mapnt.sh
19 Description: Fast and accurate splice-aware read aligner for DNA and RNA. Finds optimal global alignments. Maximum read length is 600bp.
20 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference; roughly 6 bytes per base, or 3 bytes per base with the flag "usemodulo".
21 Additional Shellscripts: removehuman.sh calls BBMap with a prebuilt index and parameters designed to remove human contamination with zero false-positives; removehuman2.sh is designed to minimize false-negatives at the expense of allowing some false-positives. mapnt.sh calls BBMap with a prebuilt index and parameters designed to allow mapping to nt while running on a 120GB node. All of these are designed exclusively for Genepool and will not function elsewhere, so should not be distributed outside LBL.
22
23 Name: align2.BBMapPacBio
24 Shellscript: mapPacBio.sh
25 Description: Version of BBMap for long reads up to 6kbp. Designed for PacBio and Nanopore reads; uses alignment penalties weighted for PacBio's error model.
26 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference and number of threads.
27
28 Name: align2.BBMapPacBioSkimmer
29 Shellscript: bbmapskimmer.sh
30 Description: Version of BBMap for mapping reads to all sites above a certain score threshold, rather than finding the single best mapping location. Uses alignment penalties weighted for PacBio's error model, as it was originally created to map Illumina reads to PacBio reads for error-correction.
31 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference and number of threads.
32
33 Name: align2.BBSplit
34 Shellscript: bbsplit.sh
35 Description: Uses BBMap to map to multiple references simultaneously, and output one file per reference, containing all the reads that match it better than the other references. Used for metagenomic binning, distinguishing between closely-related organisms, and contamination removal.
36 Notes: See BBMap.
37
38 Name: align2.BBWrap
39 Shellscript: bbwrap.sh
40 Description: Allows multiple runs of BBMap on different input files without reloading the reference. Useful when the reference is very large.
41 Notes: See BBMap.
42
43 Name: jgi.CoveragePileup
44 Shellscript: pileup.sh
45 Description: Calculates coverage information from an unsorted or sorted sam or bam file. Outputs per-scaffold coverage, per-base coverage, binned coverage, normalized coverage, per-ORF coverage (using PRODIGAL's format), coverage histograms, stranded coverage, physical coverage, FPKMs, and various others.
46 Notes: Singlethreaded, high memory. TODO: Would not be overly difficult to make a multithreaded version using A_SampleMT, but would require locks or queues.
47
48 Name: driver.SummarizeCoverage
49 Shellscript: summarizescafstats.sh
50 Description: Summarizes the scafstats output of BBMap for evaluation of cross-contamination. The intended use is to map multiple libraries or assemblies, of different multiplexed organisms, to a concatenated reference containing one fused scaffold per organism. This will convert all of the resulting stats files (one per library) to a single text file, with multiple columns, indicating how much of the input hit the primary versus nonprimary scaffolds. See also BBMap, Pileup, SummarizeSealStats.
51 Notes: Singlethreaded, low memory.
52
53 Name: jgi.FilterByCoverage
54 Shellscript: filterbycoverage.sh.
55 Description: Filters an assembly by contig coverage, to remove contigs below a coverage cutoff, or with fewer than some percent of their bases covered. Uses coverage stats produced by BBMap or Pileup.
56 Notes: Singlethreaded, low memory.
57
58 Name: driver.MergeCoverageOTU
59 Shellscript: mergeOTUs.sh
60 Description: Merges coverage stats lines (from Pileup) for the same OTU, according to some custom naming scheme. See also CoveragePileup.
61 Notes: Singlethreaded, low memory.
62
63 Name: jgi.SamToEst
64 Shellscript: bbest.sh
65 Description: Calculates EST (expressed sequence tags) capture by an assembly from a sam file. Designed to use BBMap output generated with these flags: k=13 maxindel=100000 customtag ordered
66 Notes: Singlethreaded, low memory.
67
68 Name: assemble.Postfilter
69 Shellscript: postfilter.sh
70 Description: Maps reads, then filters an assembly by contig coverage. Intended to reduce misassembly rate of SPAdes by removing suspicious contigs. See also BBMap and FilterByCoverage.
71 Notes: Multithreaded, high memory.
72
73
74 Kmer Matching
75
76 Name: jgi.BBDukF
77 Shellscript: bbduk.sh
78 Description: Multipurpose tool for read preprocessing, which does adapter-trimming, quality-trimming, contaminant filtering, entropy filtering, sequence masking, quality score recalibration, format conversion, histogram generation, barcode filtering, gc filtering, kmer cardinality estimation, and many similar tasks.
79 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference (roughly 20 bytes per kmer) and whether hdist or edist are set (they multiply memory consumption by a large factor); if no reference is loaded, little memory is needed.
80
81 Name: jgi.BBDuk2
82 Shellscript: bbduk2.sh
83 Description: Version of BBDuk that can do multiple kmer-based operations at once - left-trim, right-trim, filter, and mask.
84 Notes: See BBDuk.
85
86 Name: jgi.Seal
87 Shellscript: seal.sh
88 Description: Performs high-speed alignment-free sequence quantification or binning, by counting the number of long kmers that match between a read and a set of reference sequences. Designed for RNA-seq versus a transcriptome, metagenomic binning and abundance analysis, quantifying contamination, and similar. Very similar to BBDuk except that Seal associates each kmer with multiple reference sequences instead of just one, so it is superior in situations where multiple reference sequences may share a kmer. Unlike BBSplit, this supports unlimited read length. Can generate per-scaffold coverage, FPKMs when mapping to a transcriptome, and so forth. Also supports taxonomic classification.
89 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference (roughly 30 bytes per kmer) and whether hdist or edist are set (they multiply memory consumption by a large factor).
90
91 Name: driver.SummarizeSealStats
92 Shellscript: summarizeseal.sh
93 Description: Summarizes the stats output of Seal for evaluation of cross-contamination. The intended use is to map multiple libraries or assemblies, of different multiplexed organisms, to a concatenated reference containing one fused scaffold per organism. This will convert all of the resulting stats files (one per library) to a single text file, with multiple columns, indicating how much of the input hit the primary versus nonprimary scaffolds. Also allows filtering of certain libraries to mask some classes of contamination. Because Seal supports arbitrarily-long sequences, this is a better choice than BBMap for evaluating assemblies. See also Seal, SummarizeCoverage.
94 Notes: Singlethreaded, low memory.
95
96
97 Kmer Counting
98
99 Name: jgi.LogLog
100 Shellscript: loglog.sh
101 Description: Estimates the number of unique kmers within a dataset to within ~10%.
102 Notes: Multithreaded, low memory. This can also be done with other programs such as BBDuk by adding the loglog flag.
103
104 Name: jgi.KmerCountExact
105 Shellscript: kmercountexact.sh
106 Description: Counts kmers in sequence data. Capable of outputting the kmers and their counts as fasta or 2-column tsv, as well as a frequency histogram. No kmer length limits.
107 Notes: Multithreaded, high memory.
108
109 Name: jgi.KmerNormalize (generally referred to as BBNorm)
110 Shellscript: bbnorm.sh, ecc.sh, khist.sh
111 Description: Uses a lossy data structure (count-min sketch) to perform kmer-based normalization, error-correction, and/or depth-binning on reads.
112 Notes: Multithreaded, high memory. BBNorm will never run out of memory; rather, as the amount of data increases, the accuracy decreases. Therefore you should always use all available memory for best accuracy. The error correction by Tadpole is superior, but Tadpole can run out of memory with large datasets.
113 Additional Shellscripts: KmerNormalize is called by 3 different shellscripts, which differ only in their default parameters (which can be overridden). bbnorm.sh does 2-pass normalization only; ecc.sh does error-correction only; and khist.sh only makes a kmer histogram, without ignoring the low-quality kmers (as is done by ecc and bbnorm). But, if add the flag "ecc" to bbnorm.sh and it will do error-correction also, and so forth - with the same parameters they are all identical.
114
115 Name: jgi.CalcUniqueness
116 Shellscript: bbcountunique.sh
117 Description: Generates a kmer uniqueness histogram, binned by file position. Designed to analyze library complexity, and determine how much sequencing is needed before reaching saturation. Outputs both single-read uniqueness and pair uniqueness.
118 Notes: Singlethreaded, high memory (around 100 bytes per read pair).
119
120 Name: jgi.SmallKmerFrequency
121 Shellscript: commonkmers.sh
122 Description: Prints the most common kmers in a sequence, their counts, and the sequence header. K is limited to 15.
123 Notes: Singlethreaded, low memory. Memory is proportional to 4^k, and is trivial for short kmers under 10.
124
125 Name: jgi.KmerCoverage
126 Shellscript: kmercoverage.sh
127 Description: Annotates reads with their kmer depth.
128 Notes: Deprecated. Multithreaded, high memory.
129
130 Name: jgi.CallPeaks
131 Shellscript: callpeaks.sh
132 Description: Calls peaks from a kmer frequency histogram, such as that from BBNorm or KmerCountExact. Also estimates genome size and other statistics.
133 Notes: Singlethreaded, low memory. Normally called automatically by programs that make the histogram. The peak-calling logic is not very sophisticated and could be improved.
134
135
136 Assembly
137
138 Name: assemble.Tadpole
139 Shellscript: tadpole.sh
140 Description: Very fast kmer-based assembler, designed for haploid organisms. Performs well on single cells, viruses, organelles, and in other situations with small genomes and potentially uneven or very high coverage. Also has modes for read error-correction and extension, instead of assembly; Tadpole's error-correction is superior to BBNorm's. No upper limit on kmer length. See also KmerCountExact, KmerCompressor, LogLog, BBMerge, KmerNormalize.
141 Notes: Multithreaded, high memory. Memory consumption is a strict function of the number of unique input kmers.
142
143 Name: assemble.TadpoleWrapper
144 Shellscript: tadwrapper.sh
145 Description: Generates multiple assemblies with Tadpole to estimate the optimal kmer length.
146 Notes: Multithreaded, high memory.
147
148 Name: assemble.KmerCompressor
149 Shellscript: kcompress.sh
150 Description: Generates a minimal fasta file containing each kmer from the input sequence exactly once. Optionally allows the inclusion only of kmers within a certain depth range. Arbitrary kmer set operations are possible via multiple passes. Very similar to an assembler.
151 Notes: Multithreaded, high memory. Contains a singlethreaded phase.
152
153 Name: jgi.AssemblyStats2
154 Shellscript: stats.sh
155 Description: Generates basic assembly statistics such as scaffold count, N50, L50, GC content, gap percent, etc. Also generates per-scaffold length and base content statistics, and can estimate BBMap's memory requirements for an assembly. See also StatsWrapper.
156 Notes: Singlethreaded, low memory.
157
158 Name: jgi.AssemblyStatsWrapper
159 Shellscript: statswrapper.sh
160 Description: Generates stats on multiple assemblies, allowing tab-delimited columns with one assembly per row, and only one header.
161 Notes: Singlethreaded, low memory.
162
163 Name: jgi.CountGC
164 Shellscript: countgc.sh
165 Description: Counts GC content of reads or scaffolds.
166 Notes: Deprecated; superceded by AssemblyStats.
167
168 Name: jgi.FungalRelease
169 Shellscript: fungalrelease.sh
170 Description: Reformats a fungal assembly for release. Also creates contig and agp files.
171 Notes: Singlethreaded, low memory.
172
173
174 Taxonomy
175
176 Name: tax.FilterByTaxa
177 Shellscript: filterbytaxa.sh
178 Description: Filters sequences according to their taxonomy, as determined by the sequence name. Sequences should be labeled with a gi number, NCBI taxID, or species name. Relies on NCBI taxdump processed using taxtree.sh and gitable.sh.
179 Notes: Singlethreaded, low memory.
180
181 Name: tax.RenameGiToNcbi
182 Shellscript: gi2taxid.sh
183 Description: Renames sequences with gi numbers to NCBI taxa IDs. This allows taxonomy processing without a gi number lookup.
184 Notes: Singlethreaded, high memory. TODO: Can be made low memory if slightly altered to accept gitable.int1d files.
185
186 Name: tax.GiToNcbi
187 Shellscript: gitable.sh
188 Description: Condenses gi_taxid_nucl.dmp from NCBI taxdmp to gitable.int1d, a more efficient representation, used by other tools for translating gi numbers to taxID's. See also TaxTree.
189 Notes: Singlethreaded, high memory.
190
191 Name: tax.SortByTaxa
192 Shellscript: sortbytaxa.sh
193 Description: Sorts sequences into taxonomic order by some depth-first traversal of the Tree of Life as defined by NCBI taxdump. Sequences must be labelled with taxonomic identifiers.
194 Notes: Singlethreaded, high memory.
195
196 Name: tax.SplitByTaxa
197 Shellscript: splitbytaxa.sh
198 Description: Splits sequences according to their taxonomy, as determined by the sequence name. Sequences should be labeled with a gi number, NCBI taxID, or species name.
199 Notes: Multithreaded, high memory. If the number of threads is restricted and the sequences are fairly short, regardless of the total number, this may be run using low memory.
200
201 Name: tax.PrintTaxonomy
202 Shellscript: taxonomy.sh
203 Description: Prints the full taxonomy of a given taxonomic identifier (such as homo_sapiens).
204 Notes: Singlethreaded, low memory.
205
206 Name: tax.TaxTree
207 Shellscript: taxtree.sh
208 Description: Creates tree.taxtree from names.dmp and nodes.dmp, which are in NCBI tax dump. The taxtree file is needed for programs that can deal with taxonomy, like Seal and SortByTaxa.
209 Notes: Singlethreaded, high memory.
210
211 Name: driver.ReduceSilva
212 Shellscript: reducesilva.sh
213 Description: Reduces Silva entries down to one entry per specified taxonomic level. Designed to increase the efficiency of operations like mapping, in which having thousands of substrains represented are not helpful.
214 Notes: Singlethreaded, low memory.
215
216
217 Cross-Contamination
218
219 Name: jgi.SynthMDA
220 Shellscript: synthmda.sh
221 Description: Generates synthetic reads following an MDA-amplified single cell's coverage distribution. Designed for single-cell assembly and analysis optimization. See also CrossContaminate, RandomReads.
222 Notes: Singlethreaded, medium memory (needs around 4GB).
223
224 Name: jgi.CrossContaminate
225 Shellscript: crosscontaminate.sh
226 Description: Generates synthetic cross-contaminated files from clean files. Intended for use with synthetic reads generated by SynthMDA or RandomReads. Designed to evaluate the effects of cross-contamination on assembly, and the efficacy of decontamination methods.
227 Notes: Singlethreaded, high memory.
228
229 Name: jgi.DecontaminateByNormalization
230 Shellscript: decontaminate.sh, crossblock.sh
231 Description: Removes contaminant contigs from assemblies of multiplexed libraries via normalization and mapping.
232 Notes: Multithreaded, high memory. Mostly a wrapper for other programs like BBMap, BBNorm, and FilterByCoverage.
233
234
235 Deduplication and Clustering
236
237 Name: jgi.Dedupe
238 Shellscript: dedupe.sh
239 Description: Accepts one or more files containing sets of sequences (reads or scaffolds). Removes duplicate sequences, which may be specified to be exact matches, fully contained subsequences, or subsequence within some edit distance. Can also find overlapping sequences and group them into clusters based on transitive reachability; for example, clustering full-length 16S PacBio reads by species.
240 Notes: Multithreaded, high memory. This program has a jni mode which increases speed dramatically if an edit distance is used.
241
242 Name: jgi.Dedupe2
243 Shellscript: dedupe2.sh
244 Description: Allows more kmer seeds than Dedupe. This will be automatically called by Dedupe if needed.
245 Notes: See Dedupe.
246
247 Name: jgi.DedupeByMapping
248 Shellscript: dedupebymapping.sh
249 Description: Removes duplicate reads or read pairs from a sam/bam file based on mapping coordinates. The sam file does not need to be sorted.
250 Notes: Singlethreaded, high memory.
251
252 Name: clump.Clumpify
253 Shellscript: clumpify.sh
254 Description: Rearranges unsorted reads into small clumps of reads, such that each clump shares a kmer, and thus probably overlaps. Can also create consensus sequence from these clumps.
255 Notes: Multithreaded, low or high memory. Memory consumption may be made arbitrarily small by using a user-specified number of temp files for bucket-sorting. By default, it will try to grab all available memory.
256
257
258 Read Merging
259
260 Name: jgi.BBMerge
261 Shellscript: bbmerge.sh, bbmerge-auto.sh
262 Description: Merges paired reads into single reads by overlap detection. With sufficient coverage, can also merge nonoverlapping reads by kmer extension.
263 Notes: Multithreaded, low memory. If kmers are used (for extension or error-correction), it will need much more memory, and the shellscript bbmerge-auto.sh should be used, which tries to acquire all available RAM. This program has a jni mode which increases speed by around 20%.
264
265 Name: jgi.MateReadsMT
266 Shellscript: bbmergegapped.sh
267 Description: Uses gapped kmers to merge nonoverlapping reads.
268 Notes: Deprecated; superceded by BBMerge.
269
270
271 Synthetic Read Generation and Benchmarking
272
273 Name: align2.RandomReads3
274 Shellscript: randomreads.sh
275 Description: Generates random synthetic reads from a reference genome, annotated with their genomic origin. Allows precise customization of things like insert size and synthetic mutation type, sizes, and rates. Read names are parsed by various other BBTools to grade accuracy.
276 Notes: Singlethreaded, high memory.
277
278 Name: jgi.FakeReads
279 Shellscript: bbfakereads.sh
280 Description: Generates fake read pairs from ends of contigs or single reads. Intended for use in generating a fake LMP library for scaffolding, using additional information like another assembly, or very long reads (like PacBio). This can also be accomplished with RandomReads.
281 Notes: Singlethreaded, low memory.
282
283 Name: align2.GradeSamFile
284 Shellscript: gradesam.sh
285 Description: Grades the accuracy of an aligner (such as BBMap) by parsing the output. The reads must be single-ended and annotated as though generated by RandomReads.
286 Notes: Singlethreaded, low memory.
287
288 Name: align2.MakeRocCurve
289 Shellscript: samtoroc.sh
290 Description: Creates an ROC plot (technically, true-positive versus false-positive) from a sam or bam file of mapped reads. The reads should be single-ended with headers generated by RandomReads.
291 Notes: Singlethreaded, low memory.
292
293 Name: jgi.AddAdapters
294 Shellscript: addadapters.sh
295 Description: Randomly adds adapters to a file, or grades a trimmed file. The input is a set of reads, paired or unpaired. The output is those same reads with adapter sequence replacing some of the bases in some reads. For paired reads, adapters are located in the same position in read1 and read2. This is designed for benchmarking adapter-trimming software (such as BBDuk), and evaluating methodology. Adapters can alternately be added by RandomReads, in which case insert size is used to determine where the adapters go.
296 Notes: Singlethreaded, low memory.
297
298 Name: jgi.GradeMergedReads
299 Shellscript: grademerge.sh
300 Description: Grades the accuracy of a read-merging program (such as BBMerge) by parsing the output. The reads must be annotated by their insert size. This can be done by generating them with RandomReads and renaming with RenameReads
301 Notes: Singlethreaded, low memory.
302
303 Name: align2.PrintTime
304 Shellscript: printtime.sh
305 Description: Prints time elapsed since last called on the same file.
306 Notes: Singlethreaded, low memory.
307
308
309 16S, Primers, and Amplicons
310
311 Name: jgi.FindPrimers
312 Shellscript: msa.sh
313 Description: Aligns a query sequence to reference sequences. Outputs the best matching position per reference sequence. If there are multiple queries, only the best-matching query will be used. Designed to find primer binding sites in a sequence that may contain indels, such as a PacBio read, using a MultiStateAligner.
314 Notes: Singlethreaded, high memory. TODO: Could easily be made multithreaded using A_SampleMT.
315
316 Name: jgi.CutPrimers
317 Shellscript: cutprimers.sh
318 Description: Cuts out sequences corresponding to primers identified in sam files. Used in conjunction with FindPrimers (msa.sh).
319 Notes: Singlethreaded, low memory.
320
321 Name: jgi.IdentityMatrix
322 Shellscript: idmatrix.sh
323 Description: Generates an identity matrix via all-to-all alignment of sequences in a file. Intended for 16S or other amplicon analysis. See also CorrelateIdentity.
324 Notes: Multithreaded, high-memory. Time complexity is O(N^2) with the number of reads.
325
326 Name: driver.CorrelateIdentity
327 Shellscript: matrixtocolumns.sh
328 Description: Transforms two matched identity matrices into 2-column format, one row per entry, one column per matrix. Designed for comparing different 16S subregions. See also IdentityMatrix, FindPrimers.
329 Notes: Singlethreaded, high memory. The actual amount of memory just depends on the matrix sizes.
330
331
332 Barcodes
333
334 Name: jgi.CountBarcodes
335 Shellscript: countbarcodes.sh
336 Description: Counts the number of reads with each barcode. Assumes read names have the barcode at the end.
337 Notes: Singlethreaded, low memory.
338
339 Name: jgi.CorrelateBarcodes
340 Shellscript: filterbarcodes.sh
341 Description: Filters barcodes by quality, and generates quality histograms. See also MergeBarcodes.
342 Notes: Singlethreaded, low memory.
343
344 Name: jgi.MergeBarcodes
345 Shellscript: mergebarcodes.sh
346 Description: Concatenates barcodes and barcode quality onto read names. Designed to analyze the effects of barcode quality on library misassignment. See also CorrelateBarcodes.
347 Notes: Singlethreaded, low memory.
348
349 Name: jgi.RemoveBadBarcodes
350 Shellscript: removebadbarcodes.sh
351 Description: Removes reads with improper barcodes - either with no barcode, or a barcode containing a degenerate base.
352 Notes: Singlethreaded, low memory. Mostly a test case for extending BBTool_ST.
353
354
355 Filtering and Demultiplexing
356
357 Name: jgi.DemuxByName
358 Shellscript: demuxbyname.sh
359 Description: Demultiplexes reads into multiple files based on their name, by matching a suffix or prefix.
360 Notes: Singlethreaded, low memory.
361
362 Name: jgi.FilterBySequence
363 Shellscript: filterbysequence.sh
364 Description: Filters reads by exact sequence match. Allows case-sensitive or insensitive matches, and reverse-complement matches or only forward matches.
365 Notes: Multithreaded, high memory.
366
367 Name: driver.FilterReadsByName
368 Shellscript: filterbyname.sh
369 Description: Filters reads by name. Allows substring matching, though that is much slower.
370 Notes: Singlethreaded, low memory.
371
372 Name: jgi.FilterReadsWithSubs
373 Shellscript: filtersubs.sh
374 Description: Filters a sam file to select only reads with substitution errors for bases with quality scores in a certain interval. Used for manually examining specific reads that may contain incorrectly calibrated quality scores.
375 Notes: Singlethreaded, low memory.
376
377 Name: jgi.GetReads
378 Shellscript: getreads.sh
379 Description: Fetches the reads with specified numeric IDs (unrelated to their names). The first read (or pair) in a file has ID 0, the second read (or pair) has ID 1, etc.
380 Notes: Singlethreaded, low memory.
381
382 Name: driver.EstherFilter
383 Shellscript: estherfilter.sh
384 Description: BLASTs queries against reference, and filters out hits with scores less than 'cutoff'.
385 Notes: All the work is done by blastall, which dictates the performance characteristics.
386
387
388 JGI-Exclusive Preprocessing Wrappers
389
390 Name: jgi.BBQC
391 Shellscript: bbqc.sh
392 Description: Wrapper for various read preprocessing operations.
393 Notes: Deprecated; superceded by RQCFilter. Designed exclusively for Genepool and will not function elsewhere, so should not be distributed outside LBL.
394
395 Name: jgi.RQCFilter
396 Shellscript: rqcfilter.sh
397 Description: Acts as a wrapper/pipeline for read preprocessing. Performs quality-trimming, artifact removal, linker-trimming, adapter trimming, spike-in removal, vertebrate contaminant removal, microbial contaminant removal, and generates various histogram and statistics files used by RQC.
398 Notes: Multithreaded, high memory. Currently requires 39500m RAM and thus can run on a 40G node, but it's recommended to submit it exclusive, as all stages are fully multithreaded. Designed exclusively for Genepool and will not function elsewhere, so should not be distributed outside LBL.
399
400
401 Shredding and Sorting
402
403 Name: jgi.Shred
404 Shellscript: shred.sh
405 Description: Shreds long sequences into shorter sequences, with overlap length and variable-length options. See also Fuse.
406 Notes: Singlethreaded, low memory.
407
408 Name: jgi.FuseSequence
409 Shellscript: fuse.sh
410 Description: Fuses sequences together, padding junctions with Ns. Does not support total length greater than 2 billion. Designed for use with Seal or BBDuk to make kmer tracking for a given genome more efficient. See also Shred.
411 Notes: Singlethreaded, high memory.
412
413 Name: jgi.Shuffle
414 Shellscript: shuffle.sh
415 Description: Reorders reads randomly, keeping pairs together. Also supports some sorting operations, like alphabetically by name or by sequence.
416 Notes: Singlethreaded, high memory. All operations are in-memory.
417
418
419 Non-Sequence-Related
420
421 Name: Calcmem - Shellscript Only
422 Shellscript: calcmem.sh
423 Description: Calculates available memory for other shellscripts. Designed for Genepool but works fine on many Linux configurations.
424 Notes: If java is being killed for allocating too much memory, this is the script to fix.
425
426 Name: fileIO.TextFile
427 Shellscript: textfile.sh
428 Description: Displays contents of a text file, optionally between a start and stop line. Useful mainly in Windows where there are few command-line utilities.
429 Notes: Singlethreaded, low memory.
430
431 Name: driver.CountSharedLines
432 Shellscript: countsharedlines.sh
433 Description: Counts the number of lines shared between sets of files. One output file will be printed for each input file. For example, an output file for a file in the 'in1' set will contain one line per file in the 'in2' set, indicating how many lines are shared. This is not designed for sequence data, but more for things like sequence names or organism names. See filterlines.sh for actually filtering shared lines in a more normal fashion.
434 Notes: Singlethreaded, low memory.
435
436 Name: driver.FilterLines
437 Shellscript: filterlines.sh
438 Description: Filters lines by exact match or substring. This is not designed for sequence data, but for things like sequence names or organism names.
439 Notes: Singlethreaded, low memory.
440
441
442 Other Tools
443
444 Name: jgi.A_SampleMT
445 Shellscript: a_sample_mt.sh
446 Description: Does nothing. Serves as a template for easily making new BBTools by dropping in code.
447 Notes: Multithreaded, high memory. Be sure to modify the shellscript line " freeRam 4000m 84" as needed. The first is the amount of memory used if available memory cannot be calculated, the second is the percentage of free memory to use if it can be calculated.
448
449 Name: jgi.BBMask
450 Shellscript: bbmask.sh
451 Description: Masks sequences of low-complexity, or containing repeat kmers, or covered by mapped reads. Used to make masked versions of human, cat, dog, and mouse genomes; these are used for filtering vertebrate contamination from fungal/plant/microbial data without risk of false-positive removals.
452 Notes: Multithreaded, high memory. Uses around 2 bytes per reference base.
453
454 Name: jgi.CalcTrueQuality
455 Shellscript: calctruequality.sh
456 Description: Generates matrices used for quality-score recalibration. Requires one or more mapped sam files as input. The actual recalibration is done with another program such as BBDuk.
457 Notes: Multithreaded, low memory.
458
459 Name: jgi.MakeChimeras
460 Shellscript: makechimeras.sh
461 Description: Makes chimeric sequences by randomly fusing together nonchimeric sequences. Designed for analyzing chimera removal effectiveness.
462 Notes: Singlethreaded, low memory.
463
464 Name: jgi.PhylipToFasta
465 Shellscript: phylip2fasta.sh
466 Description: Transforms interleaved phylip to fasta.
467 Notes: Singlethreaded, high memory.
468
469 Name: jgi.MakeLengthHistogram
470 Shellscript: readlength.sh
471 Description: Makes a length histogram of sequences.
472 Notes: Singlethreaded, low memory. Can also be accomplished with Reformat or BBDuk, but with less flexibility.
473
474 Name: jgi.ReformatReads
475 Shellscript: reformat.sh
476 Description: Reformats sequence data into another format, such as interleaved ASCII-33 fastq to twin-file ASCII-64. Also supports a huge collection of simple optional operations, like trimming, filtering, reverse-complementing, modifying read names, and modifying read sequence.
477 Notes: Singlethreaded, low memory.
478
479 Name: pacbio.RemoveAdapters2
480 Shellscript: removesmartbell.sh
481 Description: Detects or removes SmartBell adapters from PacBio reads, by aligning the adapter using a customized version of the MultiStateAligner.
482 Notes: Multithreaded, low memory.
483
484 Name: jgi.RenameReads
485 Shellscript: rename.sh
486 Description: Renames reads according to some specified prefix. Can also rename by insert size or mapping location.
487 Notes: Singlethreaded, low memory.
488
489 Name: jgi.SplitPairsAndSingles
490 Shellscript: repair.sh, bbsplitpairs.sh
491 Description: Separates paired reads into files of pairs and singletons by removing reads that are shorter than a min length, or have no mate. Can also reorder arbitrarily-ordered reads in files where the pairing order was desynchronized. See also Reformat's vint flag.
492 Notes: Singlethreaded, low or high memory. All operations are low-memory except reordering arbitrarily disordered files, which is optional.
493
494 Name: jgi.SplitNexteraLMP
495 Shellscript: splitnextera.sh
496 Description: Trims and splits Nextera LMP libraries into subsets based on linker orientation: LMP, fragment, unknown, and singleton.
497 Notes: Singlethreaded, low memory. TODO: Should be reimplemented using A_SampleMT.
498
499 Name: jgi.SplitSamFile
500 Shellscript: splitsam.sh
501 Description: Splits a sam file into three files: Plus-mapped reads, Minus-mapped reads, and Unmapped.
502 Notes: Singlethreaded, low memory.
503
504 Name: fileIO.FileFormat
505 Shellscript: testformat.sh
506 Description: Tests the format of a sequence-containing file. Determines format (fasta, fastq, etc), quality encoding, compression type, interleaving, and read length. All BBTools use this to determine how to process a file.
507 Notes: Singlethreaded, low memory.
508
509 Name: jgi.TranslateSixFrames
510 Shellscript: translate6frames.sh
511 Description: Translates nucleotide sequences to all 6 amino acid frames, or amino acids to a canonical nucleotide representation.
512 Notes: Singlethreaded, low memory.
513
514
515 Template
516
517 Name:
518 Shellscript:
519 Description:
520 Notes: