Mercurial > repos > rliterman > csp2
comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/bbmap-39.01-1/docs/UsageGuide.txt @ 68:5028fdace37b
planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author | jpayne |
---|---|
date | Tue, 18 Mar 2025 16:23:26 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
67:0e9998148a16 | 68:5028fdace37b |
---|---|
1 BBMap/BBTools usage guide. | |
2 Last updated December 11, 2015 | |
3 | |
4 Table of Contents | |
5 | |
6 System Requirements | |
7 Installation | |
8 Terminology Notes | |
9 Usage | |
10 Standard Syntax | |
11 Paired Reads | |
12 Multiple Output and % Symbol | |
13 File Formats | |
14 Piping | |
15 Memory and Java Flags | |
16 Threads | |
17 Subprocesses | |
18 Additional Help | |
19 Standard flags | |
20 Help Flags | |
21 Config Flags | |
22 Input Flags | |
23 Output Flags | |
24 Sampling Flags | |
25 Compression Flags | |
26 Quality-Related Flags | |
27 Length-Related Flags | |
28 Histogram Flags | |
29 Advanced Flags | |
30 Buffer Flags | |
31 MPI and JNI Flags | |
32 | |
33 | |
34 System Requirements | |
35 | |
36 BBTools is written in Java and requires Java 7 or higher for full functionality. It is tested on Oracle's JDK, not OpenJDK. Most tools will work with Java 6 (if not, they will throw a ClassNotFound exception), and most tools will work with OpenJDK, but if you experience a problem with Java 6 or OpenJDK it is recommended that you install Oracle's latest JDK. All operating systems that support Java are supported. Note that many of the tools require a substantial amount of memory, so the 64-bit JDK should be installed if you are using a 64-bit operating system. You can determine your current Java version like this: | |
37 | |
38 java -Xmx90m -version | |
39 | |
40 Installation (for non-NERSC users; if you don't know what NERSC is, you are not a NERSC user) | |
41 | |
42 BBTools can be installed by downloading the gzipped tar file from Sourceforge (http://sourceforge.net/projects/bbmap/files/latest/download) and decompressing it. In Linux, the command would be: | |
43 | |
44 tar xvzf BBMap_35.74.tar.gz | |
45 | |
46 Then, optionally, you can export the path of the shellscripts to your environment to make it easier to run. The Java code is already compiled and does not need recompilation. Source code is also available from bitbucket (https://bitbucket.org/berkeleylab/jgi-bbtools), but this is currently a private Berkeley account. | |
47 | |
48 BBTools includes bash wrapper scripts to make the command lines shorter. The package also contains C code that can accelerate certain programs, and experimental MPI code that can make it difficult to compile BBTools on systems without MPI support. None of these is required. But, you can accelerate BBMerge, Dedupe, and BBMap by following the instructions in /jni/README.txt if you have a C compiler. | |
49 | |
50 | |
51 Installation (for NERSC users) | |
52 | |
53 BBTools is in Shifter, which can be used on Denovo and Cori with no installation. If you want to run the command "reformat.sh in=x.fq out=y.fq", the syntax would be: | |
54 shifter --image=bryce911/bbtools reformat.sh in=x.fq out=y.fq | |
55 | |
56 | |
57 Terminology Notes | |
58 | |
59 "Read" in this file is used synonymously with "sequence", whether it is contig in a fasta file or a short read produced by a sequencing platform. "Paired reads" or "pair" refer to 2 reads that are generated by sequencing both ends of a single fragment of DNA. These are typically delivered in two fastq files, named something like "read1.fastq.gz" and "read2.fastq.gz". The alternative is single-ended reads, in which only one end of the molecule is sequenced. When paired reads are available, it is important to always process them together, rather than for example mapping the read 1 file and the read 2 file in two separate processes. | |
60 | |
61 | |
62 Usage | |
63 | |
64 Most BBTools use the same syntax and operate with a set of standard flags. Individual tools also have specific flags - for example, kmer-based tools support the flag "k" to specify the kmer length, and non-kmer-based tools don't. This guide describes the standard syntax and most common flags. Custom syntax and flags for a given tool are described in that tool's shellscript. | |
65 | |
66 | |
67 Standard Syntax | |
68 | |
69 Most BBTools (such as Reformat or BBNorm) process genomic sequences in some fashion, and are executed like this: | |
70 | |
71 reformat.sh in=reads.fq out=processed.fq | |
72 | |
73 The shellscript allows autodetection of memory (in some cases) and classpath. | |
74 The above command is equivalent to this: | |
75 | |
76 java -ea -Xmx200m -cp /path/to/bbmap/current/ jgi.ReformatReads in=reads.fq out=processed.fq | |
77 | |
78 Note that “/path/to/bbmap/current/” needs to be replaced with an actual path. While the shellscript will only work in bash (or some other Linux/Unix/MacOS shells), | |
79 the full Java command will work on any environment with Java installed, such as Windows. | |
80 | |
81 Tools that use a reference (such as BBMap, BBDuk, and Seal) will also need the additional flag "ref=": | |
82 | |
83 bbmap.sh in=reads.fq out=mapped.sam ref=genome.fasta | |
84 | |
85 In each of the above cases, the flags can be arranged in any order. | |
86 | |
87 | |
88 Paired Reads | |
89 | |
90 Most BBTools support paired reads. These may be in two files, or interleaved in a single file, which BBTools will autodetect based on the read names. When the reads are in two files, you can use the in2 and out2 flags, like this: | |
91 | |
92 reformat.sh in1=read1.fq in2=read2.fq out1=processed1.fq out2=processed2.fq | |
93 | |
94 It is also possible to specify paired files like this: | |
95 | |
96 reformat.sh in=read#.fq out=processed#.fq | |
97 | |
98 ...which is equivalent to the above command. | |
99 | |
100 It is important to process paired files together in one command so that they are kept in the proper order. If you have dual input files and only 1 output file, the output will be written interleaved, and vice-versa. All tools that support paired reads will keep pairs together. For example, Reformat supports subsampling; if read 1 is discarded, read 2 will also be discarded. This prevents a loss of synchronization that corrupts the output. | |
101 | |
102 | |
103 Multiple Output and % Symbol | |
104 | |
105 Some tools (such as Seal, BBSplit, BBMap, Dedupe) can use the % symbol as a wildcard, to be replaced by some other word when generating many files from a single input. It is recommended that the % symbol be avoided in filenames. As an example, assume you run Seal to bin some reads based on matching sequences in the fasta file "ref.fa", which contains the genomes of e.coli and salmonella: | |
106 | |
107 seal.sh in=reads.fq pattern=out_%.fq ref=ref.fa | |
108 | |
109 This would produce the output files "out_e.coli.fq" and "out_salmonella.fq". | |
110 | |
111 | |
112 File Formats and Extensions | |
113 | |
114 BBTools support most standard sequence formats, including fastq, fasta, fasta+qual, scarf, sam, and (if samtools is installed) bam. They also support gzip and (if bzip2 or pbzip2 is installed) bzip2. The tools are sensitive to file extensions. For example: | |
115 | |
116 reformat.sh in=reads.fq.gz out=processed.fa | |
117 | |
118 In this case, reformat will try to read a gzip-compressed fastq file and output an uncompressed fasta file. For BBMap, this means that it will output a sam file if you name the output ".sam", bam if you name it ".bam", fastq if you name it ".fastq", and so forth. BBTools are usually capable of autodetecting input format (for example, if you feed it a fasta file called "stuff.txt" it will be able determine that it is in fasta format), but this is not recommended. Also, it is possible to specify an extensionless name for an output file, in which case the default format is used; the default varies by tool. | |
119 | |
120 List of supported file extensions: | |
121 | |
122 Fastq: fastq, fq | |
123 Fasta: fasta, fa, fas, fna, ffn, frn, seq, fsa, faa | |
124 Bread: bread (BBMap internal format; deprecated) | |
125 Sam: sam | |
126 Bam: bam | |
127 Qual: qual (should be accompanied with fasta) | |
128 Scarf: scarf (an old Illumina format; input only) | |
129 Phylip: phylip (only supported by phylip2fasta; input only) | |
130 Text: txt (used for logs, stats, and histograms) | |
131 Header: header (use this extension to write read names only) | |
132 One Line: oneline (name <tab> sequence) | |
133 Sketch: sketch (for Sketch tools). | |
134 | |
135 List of supported compression extensions: | |
136 | |
137 Gzip: gzip, gz | |
138 Bzip2: bz2 | |
139 Zip: zip | |
140 | |
141 Piping and Screen Output | |
142 | |
143 Most tools can accept input from stdin and write output to stdout, with notable exceptions being BBNorm and Tadpole in some processing modes, which require reading the input file multiple times. Piping works like this: | |
144 | |
145 cat reads.fq.gz | reformat.sh in=stdin.fq.gz out=stdout.fa int=f > x.fa | |
146 | |
147 Note that the extensions are added to stdin and stdout so that Reformat knows how to interpret the data; when piping, it cannot first autodetect the file format. Similarly, it cannot autodetect whether the reads are interleaved or not. So, "int=f" (equivalent to "interleaved=false") was added to force it to treat the data as single-ended. | |
148 | |
149 By default, all tools write status information to stderr, not stdout. To capture a program’s screen output, do this: | |
150 | |
151 reformat.sh in=a.fq out=b.fq 1>out.txt 2>err.txt | |
152 | |
153 Or, to direct both to a single file: | |
154 | |
155 reformat.sh in=a.fq out=b.fq 1>out.txt 2>&1 | |
156 | |
157 Memory and Java Flags | |
158 | |
159 There are two flags that are passed by the shellscripts directly to Java rather than to BBTools, "-Xmx" and "-da". | |
160 Java does not dynamically grow virtual memory as needed like C programs. The amount of virtual memory must be specified up front, and it will immediately be grabbed; the physical memory used will only increase as needed. The shellscripts will try to autodetect memory and set it to an appropriate value, but sometimes this will need to be overridden (for example, if you are using a shared node and don't really need all the memory, or not enough memory was allocated and the program crashed with a memory exception). To force a program to use 3 gigs of RAM, use the flag "-Xmx3g". For example: | |
161 | |
162 reformat.sh in=reads.fq out=processed.fq -Xmx3g | |
163 | |
164 That's the equivalent of: | |
165 | |
166 java -ea -Xmx3g -cp /path/to/bbmap/current/ jgi.ReformatReads in=reads.fq out=processed.fq | |
167 | |
168 The "-ea" flag means "enable assertions", which will make BBTools crash if they detect a problem. If you want to ignore the problem and force it to run anyway, you can use the "-da" flag. The -da flag may also increase speed slightly. | |
169 | |
170 | |
171 Threads | |
172 | |
173 Most BBTools are multithreaded, and will automatically detect and use all available threads. This is usually desirable when you have exclusive use of a computer, but may not be on a shared node. The number of threads can be capped at X with the flag "t=X" (threads=X). The total CPU usage may still go higher, though, due to several factors: | |
174 1) Input and output are handled in separate threads; "t=X" only regulates the number of worker threads. | |
175 2) Java uses additional threads for garbage collection and other virtual machine tasks. | |
176 3) When subprocesses (such as pigz) are spawned, they also individually obey the thread limit, but if you set "t=4" and the process spawns 3 pigz instances, you could still potentially use over 16 threads - 4 worker threads, 4 threads for each pigz process, plus other threads for the JVM and I/O. | |
177 If you have exclusive use of a computer, you don't need to worry about spawning too many threads; this is only an issue with regards to fairness on shared nodes. | |
178 | |
179 | |
180 Subprocesses | |
181 | |
182 If they are installed, BBTools will automatically use samtools for sam<->bam conversion, and bzip2 or pbzip2 for processing bz2 files. It may use pigz to accelerate processing of gzipped files, depending on the number of threads available. This is generally fine on a standalone computer, but in some circumstances, depending on the cluster configuration, the scheduler may kill a process that spawns a subprocess for violating virtual memory limits (Amazon instances may do this). In that case, you can disable pigz support with the flags "pigz=f unpigz=f". Alternatively, you can force pigz to be used with "pigz=t unpigz=t". The default for those flags depends on the tool. Pigz processes will never be spawned unless the number of threads allowed is at least 3. It is also possible to spawn gzip instances instead of pigz instances, but this only gives a small speed increase over using Java for gzip processing. | |
183 | |
184 | |
185 Additional Help | |
186 | |
187 There are many forum threads on SeqAnswers describing the usage of different BBTools, linked from this thread: | |
188 http://seqanswers.com/forums/showthread.php?t=41057 | |
189 | |
190 | |
191 *Standard flags* | |
192 | |
193 The flags below work with many or all BBTools that process reads, but the list is not complete because it does not include flags specific to only one or a few tools - those are listed in the shellscript. They are listed with their default setting, but some of the defaults differ between tools; the specific default is also listed in the shellscript. Where the description starts with something in parentheses, like "(in1)", that is an acceptable alternative version of the flag. | |
194 | |
195 | |
196 Flag Syntax | |
197 | |
198 With the exception of certain special flags like help flags (--help, --version) and Java flags (-Xmx, -da), all flags are in the same format: “a=b” where “a” is the name of the flag, and is not case-sensitive, and “b” is the value, which is case-sensitive (for filenames). Flags may be in any order, and never need leading hyphens, except for those special flags mentioned above. If a flag is set twice, the later value will override the former; for example, “reformat.sh in=x.fq in=y.fq” will use y.fq as input. The special value “null” means “blank”. For example, “reformat.sh in=a.fq out=null” and “reformat.sh in=a.fq” are equivalent - neither will output anything. | |
199 For boolean variables, “null” is equivalent to “true”, and values may be abbreviated “t and f. So, these are all equivalent: | |
200 | |
201 Help Flags | |
202 | |
203 --help Print the usage information from the shellscript (when run from a shellscript). Alternately you can just look at the shellscript with a text editor. | |
204 --version Print the version of BBTools. | |
205 | |
206 Config Files | |
207 | |
208 config=<file> A file or a comma-delimited list of files. If this flag is present, the contents of the config file will be added to the command line. Config files must contain one argument per line. Config files are never required, but may be useful when a command line would be too long or when arguments contain whitespace. See readme_config.txt for more information. | |
209 | |
210 Input Flags | |
211 | |
212 in=<file> (in1) Main input. | |
213 in2=<file> Input for 2nd read of pairs in a different file. | |
214 interleaved=auto (int) t/f overrides interleaved autodetection. | |
215 samplerate=1 Set lower to only process a fraction of input reads. | |
216 qfin=<.qual file> Read qualities from this qual file, for the reads coming from 'in' | |
217 qfin2=<.qual file> Read qualities from this qual file, for the reads coming from 'in2' | |
218 extout= Allows overriding of input file format. For example, "extin=.fq" would force the input to be read in fastq format regardless of the file name. | |
219 trimreaddescription=f (trd) Trim the names of reads after the first whitespace. | |
220 touppercase=f (tuc) Convert lowercase letters in reads to upper case. | |
221 lowercaseton=f (lctn) Convert lowercase letters in reads to N. | |
222 utot=f Convert U bases to T. | |
223 | |
224 Output Flags | |
225 | |
226 out=<file> (out1) Main output. | |
227 out2=<file> Output for 2nd read of pairs in a different file. | |
228 qfout=<.qual file> Write qualities from this qual file, for the reads going to 'out' | |
229 qfout2=<.qual file> Write qualities from this qual file, for the reads coming from 'out2' | |
230 extout= Allows overriding of output file format. For example, "extout=.fq" would force the output to be written in fastq format regardless of the file name. | |
231 fastawrap=70 Length of lines in fasta output. | |
232 overwrite=f Allow overwriting of existing files. | |
233 append=f Append to existing files. | |
234 | |
235 Sampling Flags | |
236 | |
237 reads=-1 Set to a positive number to only process this many input reads (or pairs), then quit. | |
238 samplerate=1 Randomly output only this fraction of reads; 1 means sampling is disabled. | |
239 sampleseed=-1 Set to a positive number to use that prng seed for sampling (allowing deterministic sampling). A negative number will use a random seed. | |
240 | |
241 Threading Flags | |
242 | |
243 threads=auto (t) Number of worker threads to spawn. | |
244 | |
245 Compression Flags | |
246 | |
247 ziplevel=2 (zl) Compression level for zip or gzip output; 1-9. | |
248 unpigz= Spawn a pigz process for faster decompression. Requires pigz to be installed. Valid values are t or f; the default varies by program. | |
249 pigz= Spawn a pigz process for faster compression. Requires pigz to be installed. Valid values are t, f, or a number; the default varies by program. "pigz=X" will enable pigz, and also force all pigz processes to use exactly X threads. | |
250 | |
251 Quality-Related Flags | |
252 | |
253 qin=auto Input quality offset: 33 (Sanger), 64, or auto. | |
254 qout=auto ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input). | |
255 qfake=30 Quality value used for fasta to fastq reformatting. | |
256 maxcalledquality=41 Cap quality values at this upper level. | |
257 mincalledquality=0 Cap quality values at this lower level. | |
258 ignorebadquality (ibq) Don't crash if quality values appear to be incorrect. | |
259 qtrim=f Enable or disable quality trimming. May be set to r, l, or rl to trim the right, left, or both sides. | |
260 trimq= Trim bases below this quality value. | |
261 | |
262 Length-Related Flags | |
263 | |
264 fastareadlen= Fasta sequences longer than this are broken into subsequences of at most this length, and given a suffix such as _part_1. Only works with fasta files; generally designed for mapping very long sequences with BBMap. | |
265 fastaminlen= Discard fasta sequences shorter than this. | |
266 maxlen= Has different meanings depending on the program. For BBMap, reads longer than this will be broken to pieces this length. For most other programs, it acts as a filter. | |
267 minlen= Has different meanings depending on the program. Typically, reads shorter than this will be discarded. | |
268 | |
269 Histogram Flags | |
270 | |
271 bhist=<file> Base composition histogram by position. | |
272 qhist=<file> Quality histogram by position. | |
273 qchist=<file> Count of bases with each quality value. | |
274 aqhist=<file> Histogram of average read quality. | |
275 bqhist=<file> Quality histogram designed for box plots. | |
276 lhist=<file> Read length histogram. | |
277 gchist=<file> Read GC content histogram. | |
278 gcbins=100 Number gchist bins. Set to 'auto' to use read length. | |
279 | |
280 *Advanced Flags* | |
281 Debugging and Benchmarking Flags | |
282 | |
283 | |
284 verbose=f Print status messages for debugging. | |
285 parsecustom=f Parse synthetic read names for custom data stored by RandomReads. | |
286 | |
287 Buffering and I/O Flags | |
288 | |
289 readbufferlength=200 Number of reads to store per ListNum. A ListNum is the smallest unit of work sent to a worker thread. | |
290 readbuffers= Number of ListNums to store in the queue waiting for worker threads. The default is 150% of the number of threads. | |
291 bf1= Set to true to force ByteFile1 to be used for reading files. | |
292 bf2= Set to true to force ByteFile2 to be used for reading files (faster). | |
293 | |
294 MPI and JNI Flags | |
295 | |
296 usejni=f Set to true to enable JNI-accelerated versions of BBMerge, BBMap, and Dedupe. Requires the C code to be compiled. | |
297 mpi=0 Inform the program of how many MPI processes will be used. Most programs are not currently MPI-capable. | |
298 crismpi=f Use an MPI version of ConcurrentReadInputStreams. | |
299 mpikeepall=t If using MPI, send all reads to all processes. | |
300 | |
301 |