jpayne@68
|
1 Concise descriptions of BBTools.
|
jpayne@68
|
2 For complete documentation of a specific tool, please see its shellscript, and its guide if available.
|
jpayne@68
|
3
|
jpayne@68
|
4
|
jpayne@68
|
5
|
jpayne@68
|
6 Note on threads:
|
jpayne@68
|
7
|
jpayne@68
|
8 Virtually all BBTools are multithreaded. If a description indicates that a tool is singlethreaded, that generally means there is only 1 worker thread. File input and output are usually in separate threads, so a "singlethreaded" program like ReformatReads may be observed using over 250% of the resources of a single core (in other words, 2.5 threads on average, with 1 input file and 1 output file). Programs listed as multithreaded, on the other hand, will automatically use all available threads (meaning the number of logical processors) unless restricted. Most multithreaded tools scale near-linearly with the number of cores up to at least 32.
|
jpayne@68
|
9
|
jpayne@68
|
10 Note on memory:
|
jpayne@68
|
11
|
jpayne@68
|
12 The memory usage classification of "low" or "high" is based on assumptions; with the exception of AssemblyStats (which uses a fixed amount of memory), the actual amount of memory needed varies based on the parameters and input files. While all programs can be forced to use a specific amount of memory with the -Xmx flag, the tools classified as low memory will try to grab only a small amount of memory by default when run via the shellscript, while the ones listed as high memory will try to grab all available memory.
|
jpayne@68
|
13
|
jpayne@68
|
14
|
jpayne@68
|
15 Alignment and Coverage-Related
|
jpayne@68
|
16
|
jpayne@68
|
17 Name: align2.BBMap
|
jpayne@68
|
18 Shellscript: bbmap.sh, removehuman.sh, removehuman2.sh, mapnt.sh
|
jpayne@68
|
19 Description: Fast and accurate splice-aware read aligner for DNA and RNA. Finds optimal global alignments. Maximum read length is 600bp.
|
jpayne@68
|
20 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference; roughly 6 bytes per base, or 3 bytes per base with the flag "usemodulo".
|
jpayne@68
|
21 Additional Shellscripts: removehuman.sh calls BBMap with a prebuilt index and parameters designed to remove human contamination with zero false-positives; removehuman2.sh is designed to minimize false-negatives at the expense of allowing some false-positives. mapnt.sh calls BBMap with a prebuilt index and parameters designed to allow mapping to nt while running on a 120GB node. All of these are designed exclusively for Genepool and will not function elsewhere, so should not be distributed outside LBL.
|
jpayne@68
|
22
|
jpayne@68
|
23 Name: align2.BBMapPacBio
|
jpayne@68
|
24 Shellscript: mapPacBio.sh
|
jpayne@68
|
25 Description: Version of BBMap for long reads up to 6kbp. Designed for PacBio and Nanopore reads; uses alignment penalties weighted for PacBio's error model.
|
jpayne@68
|
26 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference and number of threads.
|
jpayne@68
|
27
|
jpayne@68
|
28 Name: align2.BBMapPacBioSkimmer
|
jpayne@68
|
29 Shellscript: bbmapskimmer.sh
|
jpayne@68
|
30 Description: Version of BBMap for mapping reads to all sites above a certain score threshold, rather than finding the single best mapping location. Uses alignment penalties weighted for PacBio's error model, as it was originally created to map Illumina reads to PacBio reads for error-correction.
|
jpayne@68
|
31 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference and number of threads.
|
jpayne@68
|
32
|
jpayne@68
|
33 Name: align2.BBSplit
|
jpayne@68
|
34 Shellscript: bbsplit.sh
|
jpayne@68
|
35 Description: Uses BBMap to map to multiple references simultaneously, and output one file per reference, containing all the reads that match it better than the other references. Used for metagenomic binning, distinguishing between closely-related organisms, and contamination removal.
|
jpayne@68
|
36 Notes: See BBMap.
|
jpayne@68
|
37
|
jpayne@68
|
38 Name: align2.BBWrap
|
jpayne@68
|
39 Shellscript: bbwrap.sh
|
jpayne@68
|
40 Description: Allows multiple runs of BBMap on different input files without reloading the reference. Useful when the reference is very large.
|
jpayne@68
|
41 Notes: See BBMap.
|
jpayne@68
|
42
|
jpayne@68
|
43 Name: jgi.CoveragePileup
|
jpayne@68
|
44 Shellscript: pileup.sh
|
jpayne@68
|
45 Description: Calculates coverage information from an unsorted or sorted sam or bam file. Outputs per-scaffold coverage, per-base coverage, binned coverage, normalized coverage, per-ORF coverage (using PRODIGAL's format), coverage histograms, stranded coverage, physical coverage, FPKMs, and various others.
|
jpayne@68
|
46 Notes: Singlethreaded, high memory. TODO: Would not be overly difficult to make a multithreaded version using A_SampleMT, but would require locks or queues.
|
jpayne@68
|
47
|
jpayne@68
|
48 Name: driver.SummarizeCoverage
|
jpayne@68
|
49 Shellscript: summarizescafstats.sh
|
jpayne@68
|
50 Description: Summarizes the scafstats output of BBMap for evaluation of cross-contamination. The intended use is to map multiple libraries or assemblies, of different multiplexed organisms, to a concatenated reference containing one fused scaffold per organism. This will convert all of the resulting stats files (one per library) to a single text file, with multiple columns, indicating how much of the input hit the primary versus nonprimary scaffolds. See also BBMap, Pileup, SummarizeSealStats.
|
jpayne@68
|
51 Notes: Singlethreaded, low memory.
|
jpayne@68
|
52
|
jpayne@68
|
53 Name: jgi.FilterByCoverage
|
jpayne@68
|
54 Shellscript: filterbycoverage.sh.
|
jpayne@68
|
55 Description: Filters an assembly by contig coverage, to remove contigs below a coverage cutoff, or with fewer than some percent of their bases covered. Uses coverage stats produced by BBMap or Pileup.
|
jpayne@68
|
56 Notes: Singlethreaded, low memory.
|
jpayne@68
|
57
|
jpayne@68
|
58 Name: driver.MergeCoverageOTU
|
jpayne@68
|
59 Shellscript: mergeOTUs.sh
|
jpayne@68
|
60 Description: Merges coverage stats lines (from Pileup) for the same OTU, according to some custom naming scheme. See also CoveragePileup.
|
jpayne@68
|
61 Notes: Singlethreaded, low memory.
|
jpayne@68
|
62
|
jpayne@68
|
63 Name: jgi.SamToEst
|
jpayne@68
|
64 Shellscript: bbest.sh
|
jpayne@68
|
65 Description: Calculates EST (expressed sequence tags) capture by an assembly from a sam file. Designed to use BBMap output generated with these flags: k=13 maxindel=100000 customtag ordered
|
jpayne@68
|
66 Notes: Singlethreaded, low memory.
|
jpayne@68
|
67
|
jpayne@68
|
68 Name: assemble.Postfilter
|
jpayne@68
|
69 Shellscript: postfilter.sh
|
jpayne@68
|
70 Description: Maps reads, then filters an assembly by contig coverage. Intended to reduce misassembly rate of SPAdes by removing suspicious contigs. See also BBMap and FilterByCoverage.
|
jpayne@68
|
71 Notes: Multithreaded, high memory.
|
jpayne@68
|
72
|
jpayne@68
|
73
|
jpayne@68
|
74 Kmer Matching
|
jpayne@68
|
75
|
jpayne@68
|
76 Name: jgi.BBDukF
|
jpayne@68
|
77 Shellscript: bbduk.sh
|
jpayne@68
|
78 Description: Multipurpose tool for read preprocessing, which does adapter-trimming, quality-trimming, contaminant filtering, entropy filtering, sequence masking, quality score recalibration, format conversion, histogram generation, barcode filtering, gc filtering, kmer cardinality estimation, and many similar tasks.
|
jpayne@68
|
79 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference (roughly 20 bytes per kmer) and whether hdist or edist are set (they multiply memory consumption by a large factor); if no reference is loaded, little memory is needed.
|
jpayne@68
|
80
|
jpayne@68
|
81 Name: jgi.BBDuk2
|
jpayne@68
|
82 Shellscript: bbduk2.sh
|
jpayne@68
|
83 Description: Version of BBDuk that can do multiple kmer-based operations at once - left-trim, right-trim, filter, and mask.
|
jpayne@68
|
84 Notes: See BBDuk.
|
jpayne@68
|
85
|
jpayne@68
|
86 Name: jgi.Seal
|
jpayne@68
|
87 Shellscript: seal.sh
|
jpayne@68
|
88 Description: Performs high-speed alignment-free sequence quantification or binning, by counting the number of long kmers that match between a read and a set of reference sequences. Designed for RNA-seq versus a transcriptome, metagenomic binning and abundance analysis, quantifying contamination, and similar. Very similar to BBDuk except that Seal associates each kmer with multiple reference sequences instead of just one, so it is superior in situations where multiple reference sequences may share a kmer. Unlike BBSplit, this supports unlimited read length. Can generate per-scaffold coverage, FPKMs when mapping to a transcriptome, and so forth. Also supports taxonomic classification.
|
jpayne@68
|
89 Notes: Multithreaded, high memory. Memory usage depends on the size of the reference (roughly 30 bytes per kmer) and whether hdist or edist are set (they multiply memory consumption by a large factor).
|
jpayne@68
|
90
|
jpayne@68
|
91 Name: driver.SummarizeSealStats
|
jpayne@68
|
92 Shellscript: summarizeseal.sh
|
jpayne@68
|
93 Description: Summarizes the stats output of Seal for evaluation of cross-contamination. The intended use is to map multiple libraries or assemblies, of different multiplexed organisms, to a concatenated reference containing one fused scaffold per organism. This will convert all of the resulting stats files (one per library) to a single text file, with multiple columns, indicating how much of the input hit the primary versus nonprimary scaffolds. Also allows filtering of certain libraries to mask some classes of contamination. Because Seal supports arbitrarily-long sequences, this is a better choice than BBMap for evaluating assemblies. See also Seal, SummarizeCoverage.
|
jpayne@68
|
94 Notes: Singlethreaded, low memory.
|
jpayne@68
|
95
|
jpayne@68
|
96
|
jpayne@68
|
97 Kmer Counting
|
jpayne@68
|
98
|
jpayne@68
|
99 Name: jgi.LogLog
|
jpayne@68
|
100 Shellscript: loglog.sh
|
jpayne@68
|
101 Description: Estimates the number of unique kmers within a dataset to within ~10%.
|
jpayne@68
|
102 Notes: Multithreaded, low memory. This can also be done with other programs such as BBDuk by adding the loglog flag.
|
jpayne@68
|
103
|
jpayne@68
|
104 Name: jgi.KmerCountExact
|
jpayne@68
|
105 Shellscript: kmercountexact.sh
|
jpayne@68
|
106 Description: Counts kmers in sequence data. Capable of outputting the kmers and their counts as fasta or 2-column tsv, as well as a frequency histogram. No kmer length limits.
|
jpayne@68
|
107 Notes: Multithreaded, high memory.
|
jpayne@68
|
108
|
jpayne@68
|
109 Name: jgi.KmerNormalize (generally referred to as BBNorm)
|
jpayne@68
|
110 Shellscript: bbnorm.sh, ecc.sh, khist.sh
|
jpayne@68
|
111 Description: Uses a lossy data structure (count-min sketch) to perform kmer-based normalization, error-correction, and/or depth-binning on reads.
|
jpayne@68
|
112 Notes: Multithreaded, high memory. BBNorm will never run out of memory; rather, as the amount of data increases, the accuracy decreases. Therefore you should always use all available memory for best accuracy. The error correction by Tadpole is superior, but Tadpole can run out of memory with large datasets.
|
jpayne@68
|
113 Additional Shellscripts: KmerNormalize is called by 3 different shellscripts, which differ only in their default parameters (which can be overridden). bbnorm.sh does 2-pass normalization only; ecc.sh does error-correction only; and khist.sh only makes a kmer histogram, without ignoring the low-quality kmers (as is done by ecc and bbnorm). But, if add the flag "ecc" to bbnorm.sh and it will do error-correction also, and so forth - with the same parameters they are all identical.
|
jpayne@68
|
114
|
jpayne@68
|
115 Name: jgi.CalcUniqueness
|
jpayne@68
|
116 Shellscript: bbcountunique.sh
|
jpayne@68
|
117 Description: Generates a kmer uniqueness histogram, binned by file position. Designed to analyze library complexity, and determine how much sequencing is needed before reaching saturation. Outputs both single-read uniqueness and pair uniqueness.
|
jpayne@68
|
118 Notes: Singlethreaded, high memory (around 100 bytes per read pair).
|
jpayne@68
|
119
|
jpayne@68
|
120 Name: jgi.SmallKmerFrequency
|
jpayne@68
|
121 Shellscript: commonkmers.sh
|
jpayne@68
|
122 Description: Prints the most common kmers in a sequence, their counts, and the sequence header. K is limited to 15.
|
jpayne@68
|
123 Notes: Singlethreaded, low memory. Memory is proportional to 4^k, and is trivial for short kmers under 10.
|
jpayne@68
|
124
|
jpayne@68
|
125 Name: jgi.KmerCoverage
|
jpayne@68
|
126 Shellscript: kmercoverage.sh
|
jpayne@68
|
127 Description: Annotates reads with their kmer depth.
|
jpayne@68
|
128 Notes: Deprecated. Multithreaded, high memory.
|
jpayne@68
|
129
|
jpayne@68
|
130 Name: jgi.CallPeaks
|
jpayne@68
|
131 Shellscript: callpeaks.sh
|
jpayne@68
|
132 Description: Calls peaks from a kmer frequency histogram, such as that from BBNorm or KmerCountExact. Also estimates genome size and other statistics.
|
jpayne@68
|
133 Notes: Singlethreaded, low memory. Normally called automatically by programs that make the histogram. The peak-calling logic is not very sophisticated and could be improved.
|
jpayne@68
|
134
|
jpayne@68
|
135
|
jpayne@68
|
136 Assembly
|
jpayne@68
|
137
|
jpayne@68
|
138 Name: assemble.Tadpole
|
jpayne@68
|
139 Shellscript: tadpole.sh
|
jpayne@68
|
140 Description: Very fast kmer-based assembler, designed for haploid organisms. Performs well on single cells, viruses, organelles, and in other situations with small genomes and potentially uneven or very high coverage. Also has modes for read error-correction and extension, instead of assembly; Tadpole's error-correction is superior to BBNorm's. No upper limit on kmer length. See also KmerCountExact, KmerCompressor, LogLog, BBMerge, KmerNormalize.
|
jpayne@68
|
141 Notes: Multithreaded, high memory. Memory consumption is a strict function of the number of unique input kmers.
|
jpayne@68
|
142
|
jpayne@68
|
143 Name: assemble.TadpoleWrapper
|
jpayne@68
|
144 Shellscript: tadwrapper.sh
|
jpayne@68
|
145 Description: Generates multiple assemblies with Tadpole to estimate the optimal kmer length.
|
jpayne@68
|
146 Notes: Multithreaded, high memory.
|
jpayne@68
|
147
|
jpayne@68
|
148 Name: assemble.KmerCompressor
|
jpayne@68
|
149 Shellscript: kcompress.sh
|
jpayne@68
|
150 Description: Generates a minimal fasta file containing each kmer from the input sequence exactly once. Optionally allows the inclusion only of kmers within a certain depth range. Arbitrary kmer set operations are possible via multiple passes. Very similar to an assembler.
|
jpayne@68
|
151 Notes: Multithreaded, high memory. Contains a singlethreaded phase.
|
jpayne@68
|
152
|
jpayne@68
|
153 Name: jgi.AssemblyStats2
|
jpayne@68
|
154 Shellscript: stats.sh
|
jpayne@68
|
155 Description: Generates basic assembly statistics such as scaffold count, N50, L50, GC content, gap percent, etc. Also generates per-scaffold length and base content statistics, and can estimate BBMap's memory requirements for an assembly. See also StatsWrapper.
|
jpayne@68
|
156 Notes: Singlethreaded, low memory.
|
jpayne@68
|
157
|
jpayne@68
|
158 Name: jgi.AssemblyStatsWrapper
|
jpayne@68
|
159 Shellscript: statswrapper.sh
|
jpayne@68
|
160 Description: Generates stats on multiple assemblies, allowing tab-delimited columns with one assembly per row, and only one header.
|
jpayne@68
|
161 Notes: Singlethreaded, low memory.
|
jpayne@68
|
162
|
jpayne@68
|
163 Name: jgi.CountGC
|
jpayne@68
|
164 Shellscript: countgc.sh
|
jpayne@68
|
165 Description: Counts GC content of reads or scaffolds.
|
jpayne@68
|
166 Notes: Deprecated; superceded by AssemblyStats.
|
jpayne@68
|
167
|
jpayne@68
|
168 Name: jgi.FungalRelease
|
jpayne@68
|
169 Shellscript: fungalrelease.sh
|
jpayne@68
|
170 Description: Reformats a fungal assembly for release. Also creates contig and agp files.
|
jpayne@68
|
171 Notes: Singlethreaded, low memory.
|
jpayne@68
|
172
|
jpayne@68
|
173
|
jpayne@68
|
174 Taxonomy
|
jpayne@68
|
175
|
jpayne@68
|
176 Name: tax.FilterByTaxa
|
jpayne@68
|
177 Shellscript: filterbytaxa.sh
|
jpayne@68
|
178 Description: Filters sequences according to their taxonomy, as determined by the sequence name. Sequences should be labeled with a gi number, NCBI taxID, or species name. Relies on NCBI taxdump processed using taxtree.sh and gitable.sh.
|
jpayne@68
|
179 Notes: Singlethreaded, low memory.
|
jpayne@68
|
180
|
jpayne@68
|
181 Name: tax.RenameGiToNcbi
|
jpayne@68
|
182 Shellscript: gi2taxid.sh
|
jpayne@68
|
183 Description: Renames sequences with gi numbers to NCBI taxa IDs. This allows taxonomy processing without a gi number lookup.
|
jpayne@68
|
184 Notes: Singlethreaded, high memory. TODO: Can be made low memory if slightly altered to accept gitable.int1d files.
|
jpayne@68
|
185
|
jpayne@68
|
186 Name: tax.GiToNcbi
|
jpayne@68
|
187 Shellscript: gitable.sh
|
jpayne@68
|
188 Description: Condenses gi_taxid_nucl.dmp from NCBI taxdmp to gitable.int1d, a more efficient representation, used by other tools for translating gi numbers to taxID's. See also TaxTree.
|
jpayne@68
|
189 Notes: Singlethreaded, high memory.
|
jpayne@68
|
190
|
jpayne@68
|
191 Name: tax.SortByTaxa
|
jpayne@68
|
192 Shellscript: sortbytaxa.sh
|
jpayne@68
|
193 Description: Sorts sequences into taxonomic order by some depth-first traversal of the Tree of Life as defined by NCBI taxdump. Sequences must be labelled with taxonomic identifiers.
|
jpayne@68
|
194 Notes: Singlethreaded, high memory.
|
jpayne@68
|
195
|
jpayne@68
|
196 Name: tax.SplitByTaxa
|
jpayne@68
|
197 Shellscript: splitbytaxa.sh
|
jpayne@68
|
198 Description: Splits sequences according to their taxonomy, as determined by the sequence name. Sequences should be labeled with a gi number, NCBI taxID, or species name.
|
jpayne@68
|
199 Notes: Multithreaded, high memory. If the number of threads is restricted and the sequences are fairly short, regardless of the total number, this may be run using low memory.
|
jpayne@68
|
200
|
jpayne@68
|
201 Name: tax.PrintTaxonomy
|
jpayne@68
|
202 Shellscript: taxonomy.sh
|
jpayne@68
|
203 Description: Prints the full taxonomy of a given taxonomic identifier (such as homo_sapiens).
|
jpayne@68
|
204 Notes: Singlethreaded, low memory.
|
jpayne@68
|
205
|
jpayne@68
|
206 Name: tax.TaxTree
|
jpayne@68
|
207 Shellscript: taxtree.sh
|
jpayne@68
|
208 Description: Creates tree.taxtree from names.dmp and nodes.dmp, which are in NCBI tax dump. The taxtree file is needed for programs that can deal with taxonomy, like Seal and SortByTaxa.
|
jpayne@68
|
209 Notes: Singlethreaded, high memory.
|
jpayne@68
|
210
|
jpayne@68
|
211 Name: driver.ReduceSilva
|
jpayne@68
|
212 Shellscript: reducesilva.sh
|
jpayne@68
|
213 Description: Reduces Silva entries down to one entry per specified taxonomic level. Designed to increase the efficiency of operations like mapping, in which having thousands of substrains represented are not helpful.
|
jpayne@68
|
214 Notes: Singlethreaded, low memory.
|
jpayne@68
|
215
|
jpayne@68
|
216
|
jpayne@68
|
217 Cross-Contamination
|
jpayne@68
|
218
|
jpayne@68
|
219 Name: jgi.SynthMDA
|
jpayne@68
|
220 Shellscript: synthmda.sh
|
jpayne@68
|
221 Description: Generates synthetic reads following an MDA-amplified single cell's coverage distribution. Designed for single-cell assembly and analysis optimization. See also CrossContaminate, RandomReads.
|
jpayne@68
|
222 Notes: Singlethreaded, medium memory (needs around 4GB).
|
jpayne@68
|
223
|
jpayne@68
|
224 Name: jgi.CrossContaminate
|
jpayne@68
|
225 Shellscript: crosscontaminate.sh
|
jpayne@68
|
226 Description: Generates synthetic cross-contaminated files from clean files. Intended for use with synthetic reads generated by SynthMDA or RandomReads. Designed to evaluate the effects of cross-contamination on assembly, and the efficacy of decontamination methods.
|
jpayne@68
|
227 Notes: Singlethreaded, high memory.
|
jpayne@68
|
228
|
jpayne@68
|
229 Name: jgi.DecontaminateByNormalization
|
jpayne@68
|
230 Shellscript: decontaminate.sh, crossblock.sh
|
jpayne@68
|
231 Description: Removes contaminant contigs from assemblies of multiplexed libraries via normalization and mapping.
|
jpayne@68
|
232 Notes: Multithreaded, high memory. Mostly a wrapper for other programs like BBMap, BBNorm, and FilterByCoverage.
|
jpayne@68
|
233
|
jpayne@68
|
234
|
jpayne@68
|
235 Deduplication and Clustering
|
jpayne@68
|
236
|
jpayne@68
|
237 Name: jgi.Dedupe
|
jpayne@68
|
238 Shellscript: dedupe.sh
|
jpayne@68
|
239 Description: Accepts one or more files containing sets of sequences (reads or scaffolds). Removes duplicate sequences, which may be specified to be exact matches, fully contained subsequences, or subsequence within some edit distance. Can also find overlapping sequences and group them into clusters based on transitive reachability; for example, clustering full-length 16S PacBio reads by species.
|
jpayne@68
|
240 Notes: Multithreaded, high memory. This program has a jni mode which increases speed dramatically if an edit distance is used.
|
jpayne@68
|
241
|
jpayne@68
|
242 Name: jgi.Dedupe2
|
jpayne@68
|
243 Shellscript: dedupe2.sh
|
jpayne@68
|
244 Description: Allows more kmer seeds than Dedupe. This will be automatically called by Dedupe if needed.
|
jpayne@68
|
245 Notes: See Dedupe.
|
jpayne@68
|
246
|
jpayne@68
|
247 Name: jgi.DedupeByMapping
|
jpayne@68
|
248 Shellscript: dedupebymapping.sh
|
jpayne@68
|
249 Description: Removes duplicate reads or read pairs from a sam/bam file based on mapping coordinates. The sam file does not need to be sorted.
|
jpayne@68
|
250 Notes: Singlethreaded, high memory.
|
jpayne@68
|
251
|
jpayne@68
|
252 Name: clump.Clumpify
|
jpayne@68
|
253 Shellscript: clumpify.sh
|
jpayne@68
|
254 Description: Rearranges unsorted reads into small clumps of reads, such that each clump shares a kmer, and thus probably overlaps. Can also create consensus sequence from these clumps.
|
jpayne@68
|
255 Notes: Multithreaded, low or high memory. Memory consumption may be made arbitrarily small by using a user-specified number of temp files for bucket-sorting. By default, it will try to grab all available memory.
|
jpayne@68
|
256
|
jpayne@68
|
257
|
jpayne@68
|
258 Read Merging
|
jpayne@68
|
259
|
jpayne@68
|
260 Name: jgi.BBMerge
|
jpayne@68
|
261 Shellscript: bbmerge.sh, bbmerge-auto.sh
|
jpayne@68
|
262 Description: Merges paired reads into single reads by overlap detection. With sufficient coverage, can also merge nonoverlapping reads by kmer extension.
|
jpayne@68
|
263 Notes: Multithreaded, low memory. If kmers are used (for extension or error-correction), it will need much more memory, and the shellscript bbmerge-auto.sh should be used, which tries to acquire all available RAM. This program has a jni mode which increases speed by around 20%.
|
jpayne@68
|
264
|
jpayne@68
|
265 Name: jgi.MateReadsMT
|
jpayne@68
|
266 Shellscript: bbmergegapped.sh
|
jpayne@68
|
267 Description: Uses gapped kmers to merge nonoverlapping reads.
|
jpayne@68
|
268 Notes: Deprecated; superceded by BBMerge.
|
jpayne@68
|
269
|
jpayne@68
|
270
|
jpayne@68
|
271 Synthetic Read Generation and Benchmarking
|
jpayne@68
|
272
|
jpayne@68
|
273 Name: align2.RandomReads3
|
jpayne@68
|
274 Shellscript: randomreads.sh
|
jpayne@68
|
275 Description: Generates random synthetic reads from a reference genome, annotated with their genomic origin. Allows precise customization of things like insert size and synthetic mutation type, sizes, and rates. Read names are parsed by various other BBTools to grade accuracy.
|
jpayne@68
|
276 Notes: Singlethreaded, high memory.
|
jpayne@68
|
277
|
jpayne@68
|
278 Name: jgi.FakeReads
|
jpayne@68
|
279 Shellscript: bbfakereads.sh
|
jpayne@68
|
280 Description: Generates fake read pairs from ends of contigs or single reads. Intended for use in generating a fake LMP library for scaffolding, using additional information like another assembly, or very long reads (like PacBio). This can also be accomplished with RandomReads.
|
jpayne@68
|
281 Notes: Singlethreaded, low memory.
|
jpayne@68
|
282
|
jpayne@68
|
283 Name: align2.GradeSamFile
|
jpayne@68
|
284 Shellscript: gradesam.sh
|
jpayne@68
|
285 Description: Grades the accuracy of an aligner (such as BBMap) by parsing the output. The reads must be single-ended and annotated as though generated by RandomReads.
|
jpayne@68
|
286 Notes: Singlethreaded, low memory.
|
jpayne@68
|
287
|
jpayne@68
|
288 Name: align2.MakeRocCurve
|
jpayne@68
|
289 Shellscript: samtoroc.sh
|
jpayne@68
|
290 Description: Creates an ROC plot (technically, true-positive versus false-positive) from a sam or bam file of mapped reads. The reads should be single-ended with headers generated by RandomReads.
|
jpayne@68
|
291 Notes: Singlethreaded, low memory.
|
jpayne@68
|
292
|
jpayne@68
|
293 Name: jgi.AddAdapters
|
jpayne@68
|
294 Shellscript: addadapters.sh
|
jpayne@68
|
295 Description: Randomly adds adapters to a file, or grades a trimmed file. The input is a set of reads, paired or unpaired. The output is those same reads with adapter sequence replacing some of the bases in some reads. For paired reads, adapters are located in the same position in read1 and read2. This is designed for benchmarking adapter-trimming software (such as BBDuk), and evaluating methodology. Adapters can alternately be added by RandomReads, in which case insert size is used to determine where the adapters go.
|
jpayne@68
|
296 Notes: Singlethreaded, low memory.
|
jpayne@68
|
297
|
jpayne@68
|
298 Name: jgi.GradeMergedReads
|
jpayne@68
|
299 Shellscript: grademerge.sh
|
jpayne@68
|
300 Description: Grades the accuracy of a read-merging program (such as BBMerge) by parsing the output. The reads must be annotated by their insert size. This can be done by generating them with RandomReads and renaming with RenameReads
|
jpayne@68
|
301 Notes: Singlethreaded, low memory.
|
jpayne@68
|
302
|
jpayne@68
|
303 Name: align2.PrintTime
|
jpayne@68
|
304 Shellscript: printtime.sh
|
jpayne@68
|
305 Description: Prints time elapsed since last called on the same file.
|
jpayne@68
|
306 Notes: Singlethreaded, low memory.
|
jpayne@68
|
307
|
jpayne@68
|
308
|
jpayne@68
|
309 16S, Primers, and Amplicons
|
jpayne@68
|
310
|
jpayne@68
|
311 Name: jgi.FindPrimers
|
jpayne@68
|
312 Shellscript: msa.sh
|
jpayne@68
|
313 Description: Aligns a query sequence to reference sequences. Outputs the best matching position per reference sequence. If there are multiple queries, only the best-matching query will be used. Designed to find primer binding sites in a sequence that may contain indels, such as a PacBio read, using a MultiStateAligner.
|
jpayne@68
|
314 Notes: Singlethreaded, high memory. TODO: Could easily be made multithreaded using A_SampleMT.
|
jpayne@68
|
315
|
jpayne@68
|
316 Name: jgi.CutPrimers
|
jpayne@68
|
317 Shellscript: cutprimers.sh
|
jpayne@68
|
318 Description: Cuts out sequences corresponding to primers identified in sam files. Used in conjunction with FindPrimers (msa.sh).
|
jpayne@68
|
319 Notes: Singlethreaded, low memory.
|
jpayne@68
|
320
|
jpayne@68
|
321 Name: jgi.IdentityMatrix
|
jpayne@68
|
322 Shellscript: idmatrix.sh
|
jpayne@68
|
323 Description: Generates an identity matrix via all-to-all alignment of sequences in a file. Intended for 16S or other amplicon analysis. See also CorrelateIdentity.
|
jpayne@68
|
324 Notes: Multithreaded, high-memory. Time complexity is O(N^2) with the number of reads.
|
jpayne@68
|
325
|
jpayne@68
|
326 Name: driver.CorrelateIdentity
|
jpayne@68
|
327 Shellscript: matrixtocolumns.sh
|
jpayne@68
|
328 Description: Transforms two matched identity matrices into 2-column format, one row per entry, one column per matrix. Designed for comparing different 16S subregions. See also IdentityMatrix, FindPrimers.
|
jpayne@68
|
329 Notes: Singlethreaded, high memory. The actual amount of memory just depends on the matrix sizes.
|
jpayne@68
|
330
|
jpayne@68
|
331
|
jpayne@68
|
332 Barcodes
|
jpayne@68
|
333
|
jpayne@68
|
334 Name: jgi.CountBarcodes
|
jpayne@68
|
335 Shellscript: countbarcodes.sh
|
jpayne@68
|
336 Description: Counts the number of reads with each barcode. Assumes read names have the barcode at the end.
|
jpayne@68
|
337 Notes: Singlethreaded, low memory.
|
jpayne@68
|
338
|
jpayne@68
|
339 Name: jgi.CorrelateBarcodes
|
jpayne@68
|
340 Shellscript: filterbarcodes.sh
|
jpayne@68
|
341 Description: Filters barcodes by quality, and generates quality histograms. See also MergeBarcodes.
|
jpayne@68
|
342 Notes: Singlethreaded, low memory.
|
jpayne@68
|
343
|
jpayne@68
|
344 Name: jgi.MergeBarcodes
|
jpayne@68
|
345 Shellscript: mergebarcodes.sh
|
jpayne@68
|
346 Description: Concatenates barcodes and barcode quality onto read names. Designed to analyze the effects of barcode quality on library misassignment. See also CorrelateBarcodes.
|
jpayne@68
|
347 Notes: Singlethreaded, low memory.
|
jpayne@68
|
348
|
jpayne@68
|
349 Name: jgi.RemoveBadBarcodes
|
jpayne@68
|
350 Shellscript: removebadbarcodes.sh
|
jpayne@68
|
351 Description: Removes reads with improper barcodes - either with no barcode, or a barcode containing a degenerate base.
|
jpayne@68
|
352 Notes: Singlethreaded, low memory. Mostly a test case for extending BBTool_ST.
|
jpayne@68
|
353
|
jpayne@68
|
354
|
jpayne@68
|
355 Filtering and Demultiplexing
|
jpayne@68
|
356
|
jpayne@68
|
357 Name: jgi.DemuxByName
|
jpayne@68
|
358 Shellscript: demuxbyname.sh
|
jpayne@68
|
359 Description: Demultiplexes reads into multiple files based on their name, by matching a suffix or prefix.
|
jpayne@68
|
360 Notes: Singlethreaded, low memory.
|
jpayne@68
|
361
|
jpayne@68
|
362 Name: jgi.FilterBySequence
|
jpayne@68
|
363 Shellscript: filterbysequence.sh
|
jpayne@68
|
364 Description: Filters reads by exact sequence match. Allows case-sensitive or insensitive matches, and reverse-complement matches or only forward matches.
|
jpayne@68
|
365 Notes: Multithreaded, high memory.
|
jpayne@68
|
366
|
jpayne@68
|
367 Name: driver.FilterReadsByName
|
jpayne@68
|
368 Shellscript: filterbyname.sh
|
jpayne@68
|
369 Description: Filters reads by name. Allows substring matching, though that is much slower.
|
jpayne@68
|
370 Notes: Singlethreaded, low memory.
|
jpayne@68
|
371
|
jpayne@68
|
372 Name: jgi.FilterReadsWithSubs
|
jpayne@68
|
373 Shellscript: filtersubs.sh
|
jpayne@68
|
374 Description: Filters a sam file to select only reads with substitution errors for bases with quality scores in a certain interval. Used for manually examining specific reads that may contain incorrectly calibrated quality scores.
|
jpayne@68
|
375 Notes: Singlethreaded, low memory.
|
jpayne@68
|
376
|
jpayne@68
|
377 Name: jgi.GetReads
|
jpayne@68
|
378 Shellscript: getreads.sh
|
jpayne@68
|
379 Description: Fetches the reads with specified numeric IDs (unrelated to their names). The first read (or pair) in a file has ID 0, the second read (or pair) has ID 1, etc.
|
jpayne@68
|
380 Notes: Singlethreaded, low memory.
|
jpayne@68
|
381
|
jpayne@68
|
382 Name: driver.EstherFilter
|
jpayne@68
|
383 Shellscript: estherfilter.sh
|
jpayne@68
|
384 Description: BLASTs queries against reference, and filters out hits with scores less than 'cutoff'.
|
jpayne@68
|
385 Notes: All the work is done by blastall, which dictates the performance characteristics.
|
jpayne@68
|
386
|
jpayne@68
|
387
|
jpayne@68
|
388 JGI-Exclusive Preprocessing Wrappers
|
jpayne@68
|
389
|
jpayne@68
|
390 Name: jgi.BBQC
|
jpayne@68
|
391 Shellscript: bbqc.sh
|
jpayne@68
|
392 Description: Wrapper for various read preprocessing operations.
|
jpayne@68
|
393 Notes: Deprecated; superceded by RQCFilter. Designed exclusively for Genepool and will not function elsewhere, so should not be distributed outside LBL.
|
jpayne@68
|
394
|
jpayne@68
|
395 Name: jgi.RQCFilter
|
jpayne@68
|
396 Shellscript: rqcfilter.sh
|
jpayne@68
|
397 Description: Acts as a wrapper/pipeline for read preprocessing. Performs quality-trimming, artifact removal, linker-trimming, adapter trimming, spike-in removal, vertebrate contaminant removal, microbial contaminant removal, and generates various histogram and statistics files used by RQC.
|
jpayne@68
|
398 Notes: Multithreaded, high memory. Currently requires 39500m RAM and thus can run on a 40G node, but it's recommended to submit it exclusive, as all stages are fully multithreaded. Designed exclusively for Genepool and will not function elsewhere, so should not be distributed outside LBL.
|
jpayne@68
|
399
|
jpayne@68
|
400
|
jpayne@68
|
401 Shredding and Sorting
|
jpayne@68
|
402
|
jpayne@68
|
403 Name: jgi.Shred
|
jpayne@68
|
404 Shellscript: shred.sh
|
jpayne@68
|
405 Description: Shreds long sequences into shorter sequences, with overlap length and variable-length options. See also Fuse.
|
jpayne@68
|
406 Notes: Singlethreaded, low memory.
|
jpayne@68
|
407
|
jpayne@68
|
408 Name: jgi.FuseSequence
|
jpayne@68
|
409 Shellscript: fuse.sh
|
jpayne@68
|
410 Description: Fuses sequences together, padding junctions with Ns. Does not support total length greater than 2 billion. Designed for use with Seal or BBDuk to make kmer tracking for a given genome more efficient. See also Shred.
|
jpayne@68
|
411 Notes: Singlethreaded, high memory.
|
jpayne@68
|
412
|
jpayne@68
|
413 Name: jgi.Shuffle
|
jpayne@68
|
414 Shellscript: shuffle.sh
|
jpayne@68
|
415 Description: Reorders reads randomly, keeping pairs together. Also supports some sorting operations, like alphabetically by name or by sequence.
|
jpayne@68
|
416 Notes: Singlethreaded, high memory. All operations are in-memory.
|
jpayne@68
|
417
|
jpayne@68
|
418
|
jpayne@68
|
419 Non-Sequence-Related
|
jpayne@68
|
420
|
jpayne@68
|
421 Name: Calcmem - Shellscript Only
|
jpayne@68
|
422 Shellscript: calcmem.sh
|
jpayne@68
|
423 Description: Calculates available memory for other shellscripts. Designed for Genepool but works fine on many Linux configurations.
|
jpayne@68
|
424 Notes: If java is being killed for allocating too much memory, this is the script to fix.
|
jpayne@68
|
425
|
jpayne@68
|
426 Name: fileIO.TextFile
|
jpayne@68
|
427 Shellscript: textfile.sh
|
jpayne@68
|
428 Description: Displays contents of a text file, optionally between a start and stop line. Useful mainly in Windows where there are few command-line utilities.
|
jpayne@68
|
429 Notes: Singlethreaded, low memory.
|
jpayne@68
|
430
|
jpayne@68
|
431 Name: driver.CountSharedLines
|
jpayne@68
|
432 Shellscript: countsharedlines.sh
|
jpayne@68
|
433 Description: Counts the number of lines shared between sets of files. One output file will be printed for each input file. For example, an output file for a file in the 'in1' set will contain one line per file in the 'in2' set, indicating how many lines are shared. This is not designed for sequence data, but more for things like sequence names or organism names. See filterlines.sh for actually filtering shared lines in a more normal fashion.
|
jpayne@68
|
434 Notes: Singlethreaded, low memory.
|
jpayne@68
|
435
|
jpayne@68
|
436 Name: driver.FilterLines
|
jpayne@68
|
437 Shellscript: filterlines.sh
|
jpayne@68
|
438 Description: Filters lines by exact match or substring. This is not designed for sequence data, but for things like sequence names or organism names.
|
jpayne@68
|
439 Notes: Singlethreaded, low memory.
|
jpayne@68
|
440
|
jpayne@68
|
441
|
jpayne@68
|
442 Other Tools
|
jpayne@68
|
443
|
jpayne@68
|
444 Name: jgi.A_SampleMT
|
jpayne@68
|
445 Shellscript: a_sample_mt.sh
|
jpayne@68
|
446 Description: Does nothing. Serves as a template for easily making new BBTools by dropping in code.
|
jpayne@68
|
447 Notes: Multithreaded, high memory. Be sure to modify the shellscript line " freeRam 4000m 84" as needed. The first is the amount of memory used if available memory cannot be calculated, the second is the percentage of free memory to use if it can be calculated.
|
jpayne@68
|
448
|
jpayne@68
|
449 Name: jgi.BBMask
|
jpayne@68
|
450 Shellscript: bbmask.sh
|
jpayne@68
|
451 Description: Masks sequences of low-complexity, or containing repeat kmers, or covered by mapped reads. Used to make masked versions of human, cat, dog, and mouse genomes; these are used for filtering vertebrate contamination from fungal/plant/microbial data without risk of false-positive removals.
|
jpayne@68
|
452 Notes: Multithreaded, high memory. Uses around 2 bytes per reference base.
|
jpayne@68
|
453
|
jpayne@68
|
454 Name: jgi.CalcTrueQuality
|
jpayne@68
|
455 Shellscript: calctruequality.sh
|
jpayne@68
|
456 Description: Generates matrices used for quality-score recalibration. Requires one or more mapped sam files as input. The actual recalibration is done with another program such as BBDuk.
|
jpayne@68
|
457 Notes: Multithreaded, low memory.
|
jpayne@68
|
458
|
jpayne@68
|
459 Name: jgi.MakeChimeras
|
jpayne@68
|
460 Shellscript: makechimeras.sh
|
jpayne@68
|
461 Description: Makes chimeric sequences by randomly fusing together nonchimeric sequences. Designed for analyzing chimera removal effectiveness.
|
jpayne@68
|
462 Notes: Singlethreaded, low memory.
|
jpayne@68
|
463
|
jpayne@68
|
464 Name: jgi.PhylipToFasta
|
jpayne@68
|
465 Shellscript: phylip2fasta.sh
|
jpayne@68
|
466 Description: Transforms interleaved phylip to fasta.
|
jpayne@68
|
467 Notes: Singlethreaded, high memory.
|
jpayne@68
|
468
|
jpayne@68
|
469 Name: jgi.MakeLengthHistogram
|
jpayne@68
|
470 Shellscript: readlength.sh
|
jpayne@68
|
471 Description: Makes a length histogram of sequences.
|
jpayne@68
|
472 Notes: Singlethreaded, low memory. Can also be accomplished with Reformat or BBDuk, but with less flexibility.
|
jpayne@68
|
473
|
jpayne@68
|
474 Name: jgi.ReformatReads
|
jpayne@68
|
475 Shellscript: reformat.sh
|
jpayne@68
|
476 Description: Reformats sequence data into another format, such as interleaved ASCII-33 fastq to twin-file ASCII-64. Also supports a huge collection of simple optional operations, like trimming, filtering, reverse-complementing, modifying read names, and modifying read sequence.
|
jpayne@68
|
477 Notes: Singlethreaded, low memory.
|
jpayne@68
|
478
|
jpayne@68
|
479 Name: pacbio.RemoveAdapters2
|
jpayne@68
|
480 Shellscript: removesmartbell.sh
|
jpayne@68
|
481 Description: Detects or removes SmartBell adapters from PacBio reads, by aligning the adapter using a customized version of the MultiStateAligner.
|
jpayne@68
|
482 Notes: Multithreaded, low memory.
|
jpayne@68
|
483
|
jpayne@68
|
484 Name: jgi.RenameReads
|
jpayne@68
|
485 Shellscript: rename.sh
|
jpayne@68
|
486 Description: Renames reads according to some specified prefix. Can also rename by insert size or mapping location.
|
jpayne@68
|
487 Notes: Singlethreaded, low memory.
|
jpayne@68
|
488
|
jpayne@68
|
489 Name: jgi.SplitPairsAndSingles
|
jpayne@68
|
490 Shellscript: repair.sh, bbsplitpairs.sh
|
jpayne@68
|
491 Description: Separates paired reads into files of pairs and singletons by removing reads that are shorter than a min length, or have no mate. Can also reorder arbitrarily-ordered reads in files where the pairing order was desynchronized. See also Reformat's vint flag.
|
jpayne@68
|
492 Notes: Singlethreaded, low or high memory. All operations are low-memory except reordering arbitrarily disordered files, which is optional.
|
jpayne@68
|
493
|
jpayne@68
|
494 Name: jgi.SplitNexteraLMP
|
jpayne@68
|
495 Shellscript: splitnextera.sh
|
jpayne@68
|
496 Description: Trims and splits Nextera LMP libraries into subsets based on linker orientation: LMP, fragment, unknown, and singleton.
|
jpayne@68
|
497 Notes: Singlethreaded, low memory. TODO: Should be reimplemented using A_SampleMT.
|
jpayne@68
|
498
|
jpayne@68
|
499 Name: jgi.SplitSamFile
|
jpayne@68
|
500 Shellscript: splitsam.sh
|
jpayne@68
|
501 Description: Splits a sam file into three files: Plus-mapped reads, Minus-mapped reads, and Unmapped.
|
jpayne@68
|
502 Notes: Singlethreaded, low memory.
|
jpayne@68
|
503
|
jpayne@68
|
504 Name: fileIO.FileFormat
|
jpayne@68
|
505 Shellscript: testformat.sh
|
jpayne@68
|
506 Description: Tests the format of a sequence-containing file. Determines format (fasta, fastq, etc), quality encoding, compression type, interleaving, and read length. All BBTools use this to determine how to process a file.
|
jpayne@68
|
507 Notes: Singlethreaded, low memory.
|
jpayne@68
|
508
|
jpayne@68
|
509 Name: jgi.TranslateSixFrames
|
jpayne@68
|
510 Shellscript: translate6frames.sh
|
jpayne@68
|
511 Description: Translates nucleotide sequences to all 6 amino acid frames, or amino acids to a canonical nucleotide representation.
|
jpayne@68
|
512 Notes: Singlethreaded, low memory.
|
jpayne@68
|
513
|
jpayne@68
|
514
|
jpayne@68
|
515 Template
|
jpayne@68
|
516
|
jpayne@68
|
517 Name:
|
jpayne@68
|
518 Shellscript:
|
jpayne@68
|
519 Description:
|
jpayne@68
|
520 Notes:
|