jpayne@69: #!/bin/bash jpayne@69: jpayne@69: usage(){ jpayne@69: echo " jpayne@69: Written by Brian Bushnell jpayne@69: Last modified Jan 7, 2020 jpayne@69: jpayne@69: Description: Demultiplexes sequences into multiple files based on their names, jpayne@69: substrings of their names, or prefixes or suffixes of their names. jpayne@69: Allows unlimited output files while maintaining only a small number of open file handles. jpayne@69: jpayne@69: Usage: jpayne@69: demuxbyname.sh in= in2= out= out2= names= jpayne@69: jpayne@69: Alternately: jpayne@69: demuxbyname.sh in= out= delimiter=whitespace prefixmode=f jpayne@69: This will demultiplex by the substring after the last whitespace. jpayne@69: jpayne@69: demuxbyname.sh in= out= length=8 prefixmode=t jpayne@69: This will demultiplex by the first 8 characters of read names. jpayne@69: jpayne@69: demuxbyname.sh in= out= delimiter=: prefixmode=f jpayne@69: This will split on colons, and use the last substring as the name; useful for jpayne@69: demuxing by barcode for Illumina headers in this format: jpayne@69: @A00178:73:HH7H3DSXX:4:1101:13666:1047 1:N:0:ACGTTGGT+TGACGCAT jpayne@69: jpayne@69: in2 and out2 are for paired reads in twin files and are optional. jpayne@69: If input is paired and there is only one output file, it will be written interleaved. jpayne@69: jpayne@69: File Parameters: jpayne@69: in= Input file. jpayne@69: in2= If input reads are paired in twin files, use in2 for the second file. jpayne@69: out= Output files for reads with matched headers (must contain % symbol). jpayne@69: For example, out=out_%.fq with names XX and YY would create out_XX.fq and out_YY.fq. jpayne@69: If twin files for paired reads are desired, use the # symbol. For example, jpayne@69: out=out_%_#.fq in this case would create out_XX_1.fq, out_XX_2.fq, out_YY_1.fq, etc. jpayne@69: outu= Output file for reads with unmatched headers. jpayne@69: stats= Print statistics about how many reads went to each file. jpayne@69: jpayne@69: Processing Modes (determines how to convert a read into a name): jpayne@69: prefixmode=t (pm) Match prefix of read header. If false, match suffix of read header. jpayne@69: prefixmode=f is equivalent to suffixmode=t. jpayne@69: barcode=f Parse barcodes from Illumina headers. jpayne@69: chrom=f For mapped sam files, make one file per chromosome (scaffold) using the rname. jpayne@69: header=f Use the entire sequence header. jpayne@69: delimiter= For prefix or suffix mode, specifying a delimiter will allow exact matches even if the length is variable. jpayne@69: This allows demultiplexing based on names that are found without specifying a list of names. jpayne@69: In suffix mode, for example, everything after the last delimiter will be used. jpayne@69: Normally the delimiter will be used as a literal string (a Java regular expression); for example, ':' or 'HISEQ'. jpayne@69: But there are some special delimiters which will be replaced by the symbol they name, jpayne@69: because they are reserved in some operating systems or cause other problems. jpayne@69: These are provided for convenience due to possible OS conflicts: jpayne@69: space, tab, whitespace, pound, greaterthan, lessthan, equals, jpayne@69: colon, semicolon, bang, and, quote, singlequote jpayne@69: These are provided because they interfere with Java regular expression syntax: jpayne@69: backslash, hat, dollar, dot, pipe, questionmark, star, jpayne@69: plus, openparen, closeparen, opensquare, opencurly jpayne@69: In other words, to match '.', you should set 'delimiter=dot'. jpayne@69: substring=f Names can be substrings of read headers. Substring mode is jpayne@69: slow if the list of names is large. Requires a list of names. jpayne@69: jpayne@69: Other Processing Parameters: jpayne@69: column=-1 If positive, split the header on a delimiter and match that column (1-based). jpayne@69: For example, using this header: jpayne@69: NB501886:61:HL3GMAFXX:1:11101:10717:1140 1:N:0:ACTGAGC+ATTAGAC jpayne@69: You could demux by tile (11101) using 'delimiter=: column=5' jpayne@69: Column is 1-based (first column is 1). jpayne@69: If column is omitted when a delimiter is present, prefixmode jpayne@69: will use the first substring, and suffixmode will use the last substring. jpayne@69: names= List of strings (or files containing strings) to parse from read names. jpayne@69: If the names are in text files, there should be one name per line. jpayne@69: This is optional. If a list of names is provided, files will only be created for those names. jpayne@69: For example, 'prefixmode=t length=5' would create a file for every unique last 5 characters in read names, jpayne@69: and every read would be written to one of those files. But if there was addionally 'names=ABCDE,FGHIJ' jpayne@69: then at most 2 files would be created, and anything not matching those names would go to outu. jpayne@69: length=0 If positive, use a suffix or prefix of this length from read name instead of or in addition to the list of names. jpayne@69: For example, you could create files based on the first 8 characters of read names. jpayne@69: hdist=0 Allow a hamming distance for demultiplexing barcodes. This requires a list of names (barcodes). jpayne@69: replace= Replace some characters in the output filenames. For example, replace=+- jpayne@69: would replace the + symbol in headers with the - symbol in filenames. So you could jpayne@69: match the name ACTGAGC+ATTAGAC in the header, but write to a file named ACTGAGC-ATTAGAC. jpayne@69: jpayne@69: Buffering Parameters jpayne@69: streams=4 Allow at most this many active streams. The actual number of open files jpayne@69: will be 1 greater than this if outu is set, and doubled if output jpayne@69: is paired and written in twin files instead of interleaved. jpayne@69: minreads=0 Don't create a file for fewer than this many reads; instead, send them to unknown. jpayne@69: This option will incur additional memory usage. jpayne@69: jpayne@69: Common parameters: jpayne@69: ow=t (overwrite) Overwrites files that already exist. jpayne@69: zl=4 (ziplevel) Set compression level, 1 (low) to 9 (max). jpayne@69: int=auto (interleaved) Determines whether INPUT file is considered interleaved. jpayne@69: qin=auto ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto. jpayne@69: qout=auto ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input). jpayne@69: jpayne@69: jpayne@69: Java Parameters: jpayne@69: -Xmx This will set Java's memory usage, overriding autodetection. jpayne@69: -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. jpayne@69: The max is typically 85% of physical memory. jpayne@69: -eoom This flag will cause the process to exit if an out-of-memory jpayne@69: exception occurs. Requires Java 8u92+. jpayne@69: -da Disable assertions. jpayne@69: jpayne@69: Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems. jpayne@69: " jpayne@69: } jpayne@69: jpayne@69: #This block allows symlinked shellscripts to correctly set classpath. jpayne@69: pushd . > /dev/null jpayne@69: DIR="${BASH_SOURCE[0]}" jpayne@69: while [ -h "$DIR" ]; do jpayne@69: cd "$(dirname "$DIR")" jpayne@69: DIR="$(readlink "$(basename "$DIR")")" jpayne@69: done jpayne@69: cd "$(dirname "$DIR")" jpayne@69: DIR="$(pwd)/" jpayne@69: popd > /dev/null jpayne@69: jpayne@69: #DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )/" jpayne@69: CP="$DIR""current/" jpayne@69: jpayne@69: z="-Xmx2g" jpayne@69: z2="-Xms2g" jpayne@69: set=0 jpayne@69: jpayne@69: if [ -z "$1" ] || [[ $1 == -h ]] || [[ $1 == --help ]]; then jpayne@69: usage jpayne@69: exit jpayne@69: fi jpayne@69: jpayne@69: calcXmx () { jpayne@69: source "$DIR""/calcmem.sh" jpayne@69: setEnvironment jpayne@69: parseXmx "$@" jpayne@69: if [[ $set == 1 ]]; then jpayne@69: return jpayne@69: fi jpayne@69: freeRam 3200m 84 jpayne@69: z="-Xmx${RAM}m" jpayne@69: z2="-Xms${RAM}m" jpayne@69: } jpayne@69: calcXmx "$@" jpayne@69: jpayne@69: function demuxbyname() { jpayne@69: local CMD="java $EA $EOOM $z $z2 -cp $CP jgi.DemuxByName2 $@" jpayne@69: echo $CMD >&2 jpayne@69: eval $CMD jpayne@69: } jpayne@69: jpayne@69: demuxbyname "$@"