comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/bbmap-39.01-1/demuxbyname.sh @ 69:33d812a61356

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 17:55:14 -0400
parents
children
comparison
equal deleted inserted replaced
67:0e9998148a16 69:33d812a61356
1 #!/bin/bash
2
3 usage(){
4 echo "
5 Written by Brian Bushnell
6 Last modified Jan 7, 2020
7
8 Description: Demultiplexes sequences into multiple files based on their names,
9 substrings of their names, or prefixes or suffixes of their names.
10 Allows unlimited output files while maintaining only a small number of open file handles.
11
12 Usage:
13 demuxbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string...>
14
15 Alternately:
16 demuxbyname.sh in=<file> out=<outfile> delimiter=whitespace prefixmode=f
17 This will demultiplex by the substring after the last whitespace.
18
19 demuxbyname.sh in=<file> out=<outfile> length=8 prefixmode=t
20 This will demultiplex by the first 8 characters of read names.
21
22 demuxbyname.sh in=<file> out=<outfile> delimiter=: prefixmode=f
23 This will split on colons, and use the last substring as the name; useful for
24 demuxing by barcode for Illumina headers in this format:
25 @A00178:73:HH7H3DSXX:4:1101:13666:1047 1:N:0:ACGTTGGT+TGACGCAT
26
27 in2 and out2 are for paired reads in twin files and are optional.
28 If input is paired and there is only one output file, it will be written interleaved.
29
30 File Parameters:
31 in=<file> Input file.
32 in2=<file> If input reads are paired in twin files, use in2 for the second file.
33 out=<file> Output files for reads with matched headers (must contain % symbol).
34 For example, out=out_%.fq with names XX and YY would create out_XX.fq and out_YY.fq.
35 If twin files for paired reads are desired, use the # symbol. For example,
36 out=out_%_#.fq in this case would create out_XX_1.fq, out_XX_2.fq, out_YY_1.fq, etc.
37 outu=<file> Output file for reads with unmatched headers.
38 stats=<file> Print statistics about how many reads went to each file.
39
40 Processing Modes (determines how to convert a read into a name):
41 prefixmode=t (pm) Match prefix of read header. If false, match suffix of read header.
42 prefixmode=f is equivalent to suffixmode=t.
43 barcode=f Parse barcodes from Illumina headers.
44 chrom=f For mapped sam files, make one file per chromosome (scaffold) using the rname.
45 header=f Use the entire sequence header.
46 delimiter= For prefix or suffix mode, specifying a delimiter will allow exact matches even if the length is variable.
47 This allows demultiplexing based on names that are found without specifying a list of names.
48 In suffix mode, for example, everything after the last delimiter will be used.
49 Normally the delimiter will be used as a literal string (a Java regular expression); for example, ':' or 'HISEQ'.
50 But there are some special delimiters which will be replaced by the symbol they name,
51 because they are reserved in some operating systems or cause other problems.
52 These are provided for convenience due to possible OS conflicts:
53 space, tab, whitespace, pound, greaterthan, lessthan, equals,
54 colon, semicolon, bang, and, quote, singlequote
55 These are provided because they interfere with Java regular expression syntax:
56 backslash, hat, dollar, dot, pipe, questionmark, star,
57 plus, openparen, closeparen, opensquare, opencurly
58 In other words, to match '.', you should set 'delimiter=dot'.
59 substring=f Names can be substrings of read headers. Substring mode is
60 slow if the list of names is large. Requires a list of names.
61
62 Other Processing Parameters:
63 column=-1 If positive, split the header on a delimiter and match that column (1-based).
64 For example, using this header:
65 NB501886:61:HL3GMAFXX:1:11101:10717:1140 1:N:0:ACTGAGC+ATTAGAC
66 You could demux by tile (11101) using 'delimiter=: column=5'
67 Column is 1-based (first column is 1).
68 If column is omitted when a delimiter is present, prefixmode
69 will use the first substring, and suffixmode will use the last substring.
70 names= List of strings (or files containing strings) to parse from read names.
71 If the names are in text files, there should be one name per line.
72 This is optional. If a list of names is provided, files will only be created for those names.
73 For example, 'prefixmode=t length=5' would create a file for every unique last 5 characters in read names,
74 and every read would be written to one of those files. But if there was addionally 'names=ABCDE,FGHIJ'
75 then at most 2 files would be created, and anything not matching those names would go to outu.
76 length=0 If positive, use a suffix or prefix of this length from read name instead of or in addition to the list of names.
77 For example, you could create files based on the first 8 characters of read names.
78 hdist=0 Allow a hamming distance for demultiplexing barcodes. This requires a list of names (barcodes).
79 replace= Replace some characters in the output filenames. For example, replace=+-
80 would replace the + symbol in headers with the - symbol in filenames. So you could
81 match the name ACTGAGC+ATTAGAC in the header, but write to a file named ACTGAGC-ATTAGAC.
82
83 Buffering Parameters
84 streams=4 Allow at most this many active streams. The actual number of open files
85 will be 1 greater than this if outu is set, and doubled if output
86 is paired and written in twin files instead of interleaved.
87 minreads=0 Don't create a file for fewer than this many reads; instead, send them to unknown.
88 This option will incur additional memory usage.
89
90 Common parameters:
91 ow=t (overwrite) Overwrites files that already exist.
92 zl=4 (ziplevel) Set compression level, 1 (low) to 9 (max).
93 int=auto (interleaved) Determines whether INPUT file is considered interleaved.
94 qin=auto ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.
95 qout=auto ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).
96
97
98 Java Parameters:
99 -Xmx This will set Java's memory usage, overriding autodetection.
100 -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.
101 The max is typically 85% of physical memory.
102 -eoom This flag will cause the process to exit if an out-of-memory
103 exception occurs. Requires Java 8u92+.
104 -da Disable assertions.
105
106 Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
107 "
108 }
109
110 #This block allows symlinked shellscripts to correctly set classpath.
111 pushd . > /dev/null
112 DIR="${BASH_SOURCE[0]}"
113 while [ -h "$DIR" ]; do
114 cd "$(dirname "$DIR")"
115 DIR="$(readlink "$(basename "$DIR")")"
116 done
117 cd "$(dirname "$DIR")"
118 DIR="$(pwd)/"
119 popd > /dev/null
120
121 #DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/"
122 CP="$DIR""current/"
123
124 z="-Xmx2g"
125 z2="-Xms2g"
126 set=0
127
128 if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then
129 usage
130 exit
131 fi
132
133 calcXmx () {
134 source "$DIR""/calcmem.sh"
135 setEnvironment
136 parseXmx "$@"
137 if [[ $set == 1 ]]; then
138 return
139 fi
140 freeRam 3200m 84
141 z="-Xmx${RAM}m"
142 z2="-Xms${RAM}m"
143 }
144 calcXmx "$@"
145
146 function demuxbyname() {
147 local CMD="java $EA $EOOM $z $z2 -cp $CP jgi.DemuxByName2 $@"
148 echo $CMD >&2
149 eval $CMD
150 }
151
152 demuxbyname "$@"