comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/bbmap-39.01-1/bbcms.sh @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 17:55:14 -0400
parents
children
comparison
equal deleted inserted replaced
67:0e9998148a16 69:33d812a61356
1 #!/bin/bash
2
3 usage(){
4 echo "
5 Written by Brian Bushnell
6 Last modified September 20, 2022
7
8 Description: Error corrects reads and/or filters by depth, storing
9 kmer counts in a count-min sketch (a Bloom filter variant).
10 This uses a fixed amount of memory. The error-correction algorithm is taken
11 from Tadpole; with plenty of memory, the behavior is almost identical to
12 Tadpole. As the number of unique kmers in a dataset increases, the accuracy
13 decreases such that it will make fewer corrections. It is still capable
14 of making useful corrections far past the point where Tadpole would crash
15 by running out of memory, even with the prefilter flag. But if there is
16 sufficient memory to use Tadpole, then Tadpole is more desirable.
17
18 Because accuracy declines with an increasing number of unique kmers, it can
19 be useful with very large datasets to run this in 2 passes, with the first
20 pass for filtering only using a 2-bit filter with the flags tossjunk=t and
21 ecc=f (and possibly mincount=2 and hcf=0.4), and the second pass using a
22 4-bit filter for the actual error correction.
23
24 Usage: bbcms.sh in=<input file> out=<output> outb=<reads failing filters>
25
26 Example of use in error correction:
27 bbcms.sh in=reads.fq out=ecc.fq bits=4 hashes=3 k=31 merge
28
29 Example of use in depth filtering:
30 bbcms.sh in=reads.fq out=high.fq outb=low.fq k=31 mincount=2 ecc=f hcf=0.4
31
32 Error correction and depth filtering can be done simultaneously.
33
34 File parameters:
35 in=<file> Primary input, or read 1 input.
36 in2=<file> Read 2 input if reads are in two files.
37 out=<file> Primary read output.
38 out2=<file> Read 2 output if reads are in two files.
39 outb=<file> (outbad/outlow) Output for reads failing mincount.
40 outb2=<file> (outbad2/outlow2) Read 2 output if reads are in two files.
41 extra=<file> Additional comma-delimited files for generating kmer counts.
42 ref=<file> If ref is set, then only files in the ref list will be used
43 for kmer counts, and the input files will NOT be used for
44 counts; they will just be filtered or corrected.
45 overwrite=t (ow) Set to false to force the program to abort rather than
46 overwrite an existing file.
47
48 Hashing parameters:
49 k=31 Kmer length, currently 1-31.
50 hashes=3 Number of hashes per kmer. Higher generally reduces
51 false positives at the expense of speed; rapidly
52 diminishing returns above 4.
53 ksmall= Optional sub-kmer length; setting to slightly lower than k
54 can improve memory efficiency by reducing the number of hashes
55 needed. e.g. 'k=31 ksmall=29 hashes=2' has better speed and
56 accuracy than 'k=31 hashes=3' when the filter is very full.
57 minprob=0.5 Ignore kmers with probability of being correct below this.
58 memmult=1.0 Fraction of free memory to use for Bloom filter. 1.0 should
59 generally work; if the program crashes with an out of memory
60 error, set this lower. You may be able to increase accuracy
61 by setting it slightly higher.
62 cells= Option to set the number of cells manually. By default this
63 will be autoset to use all available memory. The only reason
64 to set this is to ensure deterministic output.
65 seed=0 This will change the hash function used. Useful if running
66 iteratively with a very full table. -1 uses a random seed.
67 symmetricwrite=t (sw) Increases counting accuracy for a slight speed penalty.
68 Could be slow on very low-complexity sequence.
69
70 Depth filtering parameters:
71 mincount=0 If positive, reads with kmer counts below mincount will
72 be discarded (sent to outb).
73 hcf=1.0 (highcountfraction) Fraction of kmers that must be at least
74 mincount to pass.
75 requireboth=t Require both reads in a pair to pass in order to go to out.
76 When true, if either read has a count below mincount, both
77 reads in the pair will go to outb. When false, reads will
78 only go to outb if both fail.
79 tossjunk=f Remove reads or pairs with outermost kmer depth below 2.
80 (Suggested params for huge metagenomes: mincount=2 hcf=0.4 tossjunk=t)
81
82 Error correction parameters:
83 ecc=t Perform error correction.
84 bits= Bits used to store kmer counts; max count is 2^bits-1.
85 Supports 2, 4, 8, 16, or 32. 16 is best for high-depth data;
86 2 or 4 are for huge, low-depth metagenomes that saturate the
87 bloom filter otherwise. Generally 4 bits is recommended for
88 error-correction and 2 bits is recommended for filtering only.
89 ecco=f Error-correct paired reads by overlap prior to kmer-counting.
90 merge=t Merge paired reads by overlap prior to kmer-counting, and
91 again prior to correction. Output will still be unmerged.
92 smooth=3 Remove spikes from kmer counts due to hash collisions.
93 The number is the max width of peaks to be smoothed; range is
94 0-3 (3 is most aggressive; 0 disables smoothing).
95 This also affects tossjunk.
96
97
98 Java Parameters:
99 -Xmx This will set Java's memory usage, overriding autodetection.
100 -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will
101 specify 200 megs. The max is typically 85% of physical memory.
102 -eoom This flag will cause the process to exit if an out-of-memory
103 exception occurs. Requires Java 8u92+.
104 -da Disable assertions.
105
106 Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
107 "
108 }
109
110 #This block allows symlinked shellscripts to correctly set classpath.
111 pushd . > /dev/null
112 DIR="${BASH_SOURCE[0]}"
113 while [ -h "$DIR" ]; do
114 cd "$(dirname "$DIR")"
115 DIR="$(readlink "$(basename "$DIR")")"
116 done
117 cd "$(dirname "$DIR")"
118 DIR="$(pwd)/"
119 popd > /dev/null
120
121 #DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
122 CP="$DIR""current/"
123
124 z="-Xmx4g"
125 z2="-Xms4g"
126 set=0
127
128 if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
129 usage
130 exit
131 fi
132
133 calcXmx () {
134 source "$DIR""/calcmem.sh"
135 setEnvironment
136 parseXmx "$@"
137 if [[ $set == 1 ]]; then
138 return
139 fi
140 freeRam 4000m 84
141 z="-Xmx${RAM}m"
142 z2="-Xms${RAM}m"
143 }
144 calcXmx "$@"
145
146 bloomfilter() {
147 local CMD="java $EA $EOOM $z $z2 -cp $CP bloom.BloomFilterCorrectorWrapper $@"
148 echo $CMD >&2
149 eval $CMD
150 }
151
152 bloomfilter "$@"