Mercurial > repos > rliterman > csp2
comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/opt/bbmap-39.01-1/demuxbyname.sh @ 69:33d812a61356
planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author | jpayne |
---|---|
date | Tue, 18 Mar 2025 17:55:14 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
67:0e9998148a16 | 69:33d812a61356 |
---|---|
1 #!/bin/bash | |
2 | |
3 usage(){ | |
4 echo " | |
5 Written by Brian Bushnell | |
6 Last modified Jan 7, 2020 | |
7 | |
8 Description: Demultiplexes sequences into multiple files based on their names, | |
9 substrings of their names, or prefixes or suffixes of their names. | |
10 Allows unlimited output files while maintaining only a small number of open file handles. | |
11 | |
12 Usage: | |
13 demuxbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string...> | |
14 | |
15 Alternately: | |
16 demuxbyname.sh in=<file> out=<outfile> delimiter=whitespace prefixmode=f | |
17 This will demultiplex by the substring after the last whitespace. | |
18 | |
19 demuxbyname.sh in=<file> out=<outfile> length=8 prefixmode=t | |
20 This will demultiplex by the first 8 characters of read names. | |
21 | |
22 demuxbyname.sh in=<file> out=<outfile> delimiter=: prefixmode=f | |
23 This will split on colons, and use the last substring as the name; useful for | |
24 demuxing by barcode for Illumina headers in this format: | |
25 @A00178:73:HH7H3DSXX:4:1101:13666:1047 1:N:0:ACGTTGGT+TGACGCAT | |
26 | |
27 in2 and out2 are for paired reads in twin files and are optional. | |
28 If input is paired and there is only one output file, it will be written interleaved. | |
29 | |
30 File Parameters: | |
31 in=<file> Input file. | |
32 in2=<file> If input reads are paired in twin files, use in2 for the second file. | |
33 out=<file> Output files for reads with matched headers (must contain % symbol). | |
34 For example, out=out_%.fq with names XX and YY would create out_XX.fq and out_YY.fq. | |
35 If twin files for paired reads are desired, use the # symbol. For example, | |
36 out=out_%_#.fq in this case would create out_XX_1.fq, out_XX_2.fq, out_YY_1.fq, etc. | |
37 outu=<file> Output file for reads with unmatched headers. | |
38 stats=<file> Print statistics about how many reads went to each file. | |
39 | |
40 Processing Modes (determines how to convert a read into a name): | |
41 prefixmode=t (pm) Match prefix of read header. If false, match suffix of read header. | |
42 prefixmode=f is equivalent to suffixmode=t. | |
43 barcode=f Parse barcodes from Illumina headers. | |
44 chrom=f For mapped sam files, make one file per chromosome (scaffold) using the rname. | |
45 header=f Use the entire sequence header. | |
46 delimiter= For prefix or suffix mode, specifying a delimiter will allow exact matches even if the length is variable. | |
47 This allows demultiplexing based on names that are found without specifying a list of names. | |
48 In suffix mode, for example, everything after the last delimiter will be used. | |
49 Normally the delimiter will be used as a literal string (a Java regular expression); for example, ':' or 'HISEQ'. | |
50 But there are some special delimiters which will be replaced by the symbol they name, | |
51 because they are reserved in some operating systems or cause other problems. | |
52 These are provided for convenience due to possible OS conflicts: | |
53 space, tab, whitespace, pound, greaterthan, lessthan, equals, | |
54 colon, semicolon, bang, and, quote, singlequote | |
55 These are provided because they interfere with Java regular expression syntax: | |
56 backslash, hat, dollar, dot, pipe, questionmark, star, | |
57 plus, openparen, closeparen, opensquare, opencurly | |
58 In other words, to match '.', you should set 'delimiter=dot'. | |
59 substring=f Names can be substrings of read headers. Substring mode is | |
60 slow if the list of names is large. Requires a list of names. | |
61 | |
62 Other Processing Parameters: | |
63 column=-1 If positive, split the header on a delimiter and match that column (1-based). | |
64 For example, using this header: | |
65 NB501886:61:HL3GMAFXX:1:11101:10717:1140 1:N:0:ACTGAGC+ATTAGAC | |
66 You could demux by tile (11101) using 'delimiter=: column=5' | |
67 Column is 1-based (first column is 1). | |
68 If column is omitted when a delimiter is present, prefixmode | |
69 will use the first substring, and suffixmode will use the last substring. | |
70 names= List of strings (or files containing strings) to parse from read names. | |
71 If the names are in text files, there should be one name per line. | |
72 This is optional. If a list of names is provided, files will only be created for those names. | |
73 For example, 'prefixmode=t length=5' would create a file for every unique last 5 characters in read names, | |
74 and every read would be written to one of those files. But if there was addionally 'names=ABCDE,FGHIJ' | |
75 then at most 2 files would be created, and anything not matching those names would go to outu. | |
76 length=0 If positive, use a suffix or prefix of this length from read name instead of or in addition to the list of names. | |
77 For example, you could create files based on the first 8 characters of read names. | |
78 hdist=0 Allow a hamming distance for demultiplexing barcodes. This requires a list of names (barcodes). | |
79 replace= Replace some characters in the output filenames. For example, replace=+- | |
80 would replace the + symbol in headers with the - symbol in filenames. So you could | |
81 match the name ACTGAGC+ATTAGAC in the header, but write to a file named ACTGAGC-ATTAGAC. | |
82 | |
83 Buffering Parameters | |
84 streams=4 Allow at most this many active streams. The actual number of open files | |
85 will be 1 greater than this if outu is set, and doubled if output | |
86 is paired and written in twin files instead of interleaved. | |
87 minreads=0 Don't create a file for fewer than this many reads; instead, send them to unknown. | |
88 This option will incur additional memory usage. | |
89 | |
90 Common parameters: | |
91 ow=t (overwrite) Overwrites files that already exist. | |
92 zl=4 (ziplevel) Set compression level, 1 (low) to 9 (max). | |
93 int=auto (interleaved) Determines whether INPUT file is considered interleaved. | |
94 qin=auto ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto. | |
95 qout=auto ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input). | |
96 | |
97 | |
98 Java Parameters: | |
99 -Xmx This will set Java's memory usage, overriding autodetection. | |
100 -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. | |
101 The max is typically 85% of physical memory. | |
102 -eoom This flag will cause the process to exit if an out-of-memory | |
103 exception occurs. Requires Java 8u92+. | |
104 -da Disable assertions. | |
105 | |
106 Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems. | |
107 " | |
108 } | |
109 | |
110 #This block allows symlinked shellscripts to correctly set classpath. | |
111 pushd . > /dev/null | |
112 DIR="${BASH_SOURCE[0]}" | |
113 while [ -h "$DIR" ]; do | |
114 cd "$(dirname "$DIR")" | |
115 DIR="$(readlink "$(basename "$DIR")")" | |
116 done | |
117 cd "$(dirname "$DIR")" | |
118 DIR="$(pwd)/" | |
119 popd > /dev/null | |
120 | |
121 #DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/" | |
122 CP="$DIR""current/" | |
123 | |
124 z="-Xmx2g" | |
125 z2="-Xms2g" | |
126 set=0 | |
127 | |
128 if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then | |
129 usage | |
130 exit | |
131 fi | |
132 | |
133 calcXmx () { | |
134 source "$DIR""/calcmem.sh" | |
135 setEnvironment | |
136 parseXmx "$@" | |
137 if [[ $set == 1 ]]; then | |
138 return | |
139 fi | |
140 freeRam 3200m 84 | |
141 z="-Xmx${RAM}m" | |
142 z2="-Xms${RAM}m" | |
143 } | |
144 calcXmx "$@" | |
145 | |
146 function demuxbyname() { | |
147 local CMD="java $EA $EOOM $z $z2 -cp $CP jgi.DemuxByName2 $@" | |
148 echo $CMD >&2 | |
149 eval $CMD | |
150 } | |
151 | |
152 demuxbyname "$@" |