comparison CSP2/CSP2_env/env-d9b9114564458d9d-741b3de822f2aaca6c6caa4325c4afce/share/man/man1/samtools.1 @ 68:5028fdace37b

planemo upload commit 2e9511a184a1ca667c7be0c6321a36dc4e3d116d
author jpayne
date Tue, 18 Mar 2025 16:23:26 -0400
parents
children
comparison
equal deleted inserted replaced
67:0e9998148a16 68:5028fdace37b
1 '\" t
2 .TH samtools 1 "21 June 2017" "samtools-1.5" "Bioinformatics tools"
3 .SH NAME
4 samtools \- Utilities for the Sequence Alignment/Map (SAM) format
5 .\"
6 .\" Copyright (C) 2008-2011, 2013-2017 Genome Research Ltd.
7 .\" Portions copyright (C) 2010, 2011 Broad Institute.
8 .\"
9 .\" Author: Heng Li <lh3@sanger.ac.uk>
10 .\" Author: Joshua C. Randall <jcrandall@alum.mit.edu>
11 .\"
12 .\" Permission is hereby granted, free of charge, to any person obtaining a
13 .\" copy of this software and associated documentation files (the "Software"),
14 .\" to deal in the Software without restriction, including without limitation
15 .\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
16 .\" and/or sell copies of the Software, and to permit persons to whom the
17 .\" Software is furnished to do so, subject to the following conditions:
18 .\"
19 .\" The above copyright notice and this permission notice shall be included in
20 .\" all copies or substantial portions of the Software.
21 .\"
22 .\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
23 .\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
24 .\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
25 .\" THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
26 .\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
27 .\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
28 .\" DEALINGS IN THE SOFTWARE.
29 .
30 .\" For code blocks and examples (cf groff's Ultrix-specific man macros)
31 .de EX
32
33 . in +\\$1
34 . nf
35 . ft CR
36 ..
37 .de EE
38 . ft
39 . fi
40 . in
41
42 ..
43 .
44 .SH SYNOPSIS
45 .PP
46 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
47 .PP
48 samtools sort -T /tmp/aln.sorted -o aln.sorted.bam aln.bam
49 .PP
50 samtools index aln.sorted.bam
51 .PP
52 samtools idxstats aln.sorted.bam
53 .PP
54 samtools flagstat aln.sorted.bam
55 .PP
56 samtools stats aln.sorted.bam
57 .PP
58 samtools bedcov aln.sorted.bam
59 .PP
60 samtools depth aln.sorted.bam
61 .PP
62 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
63 .PP
64 samtools merge out.bam in1.bam in2.bam in3.bam
65 .PP
66 samtools faidx ref.fasta
67 .PP
68 samtools tview aln.sorted.bam ref.fasta
69 .PP
70 samtools split merged.bam
71 .PP
72 samtools quickcheck in1.bam in2.cram
73 .PP
74 samtools dict -a GRCh38 -s "Homo sapiens" ref.fasta
75 .PP
76 samtools fixmate in.namesorted.sam out.bam
77 .PP
78 samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
79 .PP
80 samtools flags PAIRED,UNMAP,MUNMAP
81 .PP
82 samtools fastq input.bam > output.fastq
83 .PP
84 samtools fasta input.bam > output.fasta
85 .PP
86 samtools addreplacerg -r 'ID:fish' -r 'LB:1334' -r 'SM:alpha' -o output.bam input.bam
87 .PP
88 samtools collate aln.sorted.bam aln.name_collated.bam
89 .PP
90 samtools depad input.bam
91
92 .SH DESCRIPTION
93 .PP
94 Samtools is a set of utilities that manipulate alignments in the BAM
95 format. It imports from and exports to the SAM (Sequence Alignment/Map)
96 format, does sorting, merging and indexing, and allows to retrieve reads
97 in any regions swiftly.
98
99 Samtools is designed to work on a stream. It regards an input file `-'
100 as the standard input (stdin) and an output file `-' as the standard
101 output (stdout). Several commands can thus be combined with Unix
102 pipes. Samtools always output warning and error messages to the standard
103 error output (stderr).
104
105 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
106 HTTP server if the BAM file name starts with `ftp://' or `http://'.
107 Samtools checks the current working directory for the index file and
108 will download the index upon absence. Samtools does not retrieve the
109 entire alignment file unless it is asked to do so.
110
111 .SH COMMANDS AND OPTIONS
112
113 .TP 10 \"-------- view
114 .B view
115 samtools view
116 .RI [ options ]
117 .IR in.sam | in.bam | in.cram
118 .RI [ region ...]
119
120 With no options or regions specified, prints all alignments in the specified
121 input alignment file (in SAM, BAM, or CRAM format) to standard output
122 in SAM format (with no header).
123
124 You may specify one or more space-separated region specifications after the
125 input filename to restrict output to only those alignments which overlap the
126 specified region(s). Use of region specifications requires a coordinate-sorted
127 and indexed input file (in BAM or CRAM format).
128
129 The
130 .BR -b ,
131 .BR -C ,
132 .BR -1 ,
133 .BR -u ,
134 .BR -h ,
135 .BR -H ,
136 and
137 .B -c
138 options change the output format from the default of headerless SAM, and the
139 .B -o
140 and
141 .B -U
142 options set the output file name(s).
143
144 The
145 .B -t
146 and
147 .B -T
148 options provide additional reference data. One of these two options is required
149 when SAM input does not contain @SQ headers, and the
150 .B -T
151 option is required whenever writing CRAM output.
152
153 The
154 .BR -L ,
155 .BR -r ,
156 .BR -R ,
157 .BR -s ,
158 .BR -q ,
159 .BR -l ,
160 .BR -m ,
161 .BR -f ,
162 .BR -F ,
163 and
164 .B -G
165 options filter the alignments that will be included in the output to only those
166 alignments that match certain criteria.
167
168 The
169 .B -x
170 and
171 .B -B
172 options modify the data which is contained in each alignment.
173
174 Finally, the
175 .B -@
176 option can be used to allocate additional threads to be used for compression, and the
177 .B -?
178 option requests a long help message.
179
180 .TP
181 .B REGIONS:
182 .RS
183 Regions can be specified as: RNAME[:STARTPOS[-ENDPOS]] and all position
184 coordinates are 1-based.
185
186 Important note: when multiple regions are given, some alignments may be output
187 multiple times if they overlap more than one of the specified regions.
188
189 Examples of region specifications:
190 .TP 10
191 .B chr1
192 Output all alignments mapped to the reference sequence named `chr1' (i.e. @SQ SN:chr1).
193 .TP
194 .B chr2:1000000
195 The region on chr2 beginning at base position 1,000,000 and ending at the
196 end of the chromosome.
197 .TP
198 .B chr3:1000-2000
199 The 1001bp region on chr3 beginning at base position 1,000 and ending at base
200 position 2,000 (including both end positions).
201 .TP
202 .B '*'
203 Output the unmapped reads at the end of the file.
204 (This does not include any unmapped reads placed on a reference sequence
205 alongside their mapped mates.)
206 .TP
207 .B .
208 Output all alignments.
209 (Mostly unnecessary as not specifying a region at all has the same effect.)
210 .RE
211
212 .B OPTIONS:
213 .RS
214 .TP 10
215 .B -b
216 Output in the BAM format.
217 .TP
218 .B -C
219 Output in the CRAM format (requires -T).
220 .TP
221 .B -1
222 Enable fast BAM compression (implies -b).
223 .TP
224 .B -u
225 Output uncompressed BAM. This option saves time spent on
226 compression/decompression and is thus preferred when the output is piped
227 to another samtools command.
228 .TP
229 .B -h
230 Include the header in the output.
231 .TP
232 .B -H
233 Output the header only.
234 .TP
235 .B -c
236 Instead of printing the alignments, only count them and print the
237 total number. All filter options, such as
238 .BR -f ,
239 .BR -F ,
240 and
241 .BR -q ,
242 are taken into account.
243 .TP
244 .B -?
245 Output long help and exit immediately.
246 .TP
247 .BI "-o " FILE
248 Output to
249 .I FILE [stdout].
250 .TP
251 .BI "-U " FILE
252 Write alignments that are
253 .I not
254 selected by the various filter options to
255 .IR FILE .
256 When this option is used, all alignments (or all alignments intersecting the
257 .I regions
258 specified) are written to either the output file or this file, but never both.
259 .TP
260 .BI "-t " FILE
261 A tab-delimited
262 .IR FILE .
263 Each line must contain the reference name in the first column and the length of
264 the reference in the second column, with one line for each distinct reference.
265 Any additional fields beyond the second column are ignored. This file also
266 defines the order of the reference sequences in sorting. If you run:
267 `samtools faidx <ref.fa>', the resulting index file
268 .I <ref.fa>.fai
269 can be used as this
270 .IR FILE .
271 .TP
272 .BI "-T " FILE
273 A FASTA format reference
274 .IR FILE ,
275 optionally compressed by
276 .B bgzip
277 and ideally indexed by
278 .B samtools
279 .BR faidx .
280 If an index is not present, one will be generated for you.
281 .TP
282 .BI "-L " FILE
283 Only output alignments overlapping the input BED
284 .I FILE
285 [null].
286 .TP
287 .BI "-r " STR
288 Only output alignments in read group
289 .I STR
290 [null].
291 .TP
292 .BI "-R " FILE
293 Output alignments in read groups listed in
294 .I FILE
295 [null].
296 .TP
297 .BI "-q " INT
298 Skip alignments with MAPQ smaller than
299 .I INT
300 [0].
301 .TP
302 .BI "-l " STR
303 Only output alignments in library
304 .I STR
305 [null].
306 .TP
307 .BI "-m " INT
308 Only output alignments with number of CIGAR bases consuming query
309 sequence \(>=
310 .I INT
311 [0]
312 .TP
313 .BI "-f " INT
314 Only output alignments with all bits set in
315 .I INT
316 present in the FLAG field.
317 .I INT
318 can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
319 or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
320 .TP
321 .BI "-F " INT
322 Do not output alignments with any bits set in
323 .I INT
324 present in the FLAG field.
325 .I INT
326 can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
327 or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
328 .TP
329 .BI "-G " INT
330 Do not output alignments with all bits set in
331 .I INT
332 present in the FLAG field. This is the opposite of \fI-f\fR such
333 that \fI-f12 -G12\fR is the same as no filtering at all.
334 .I INT
335 can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
336 or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
337 .TP
338 .BI "-x " STR
339 Read tag to exclude from output (repeatable) [null]
340 .TP
341 .B -B
342 Collapse the backward CIGAR operation.
343 .TP
344 .BI "-s " FLOAT
345 Output only a proportion of the input alignments.
346 This subsampling acts in the same way on all of the alignment records in
347 the same template or read pair, so it never keeps a read but not its mate.
348 .IP
349 The integer and fractional parts of the
350 .BI "-s " INT . FRAC
351 option are used separately: the part after the
352 decimal point sets the fraction of templates/pairs to be kept,
353 while the integer part is used as a seed that influences
354 .I which
355 subset of reads is kept.
356 .IP
357 .\" Reads are retained based on a score computed by hashing their QNAME
358 .\" field and the seed value.
359 When subsampling data that has previously been subsampled, be sure to use
360 a different seed value from those used previously; otherwise more reads
361 will be retained than expected.
362 .TP
363 .BI "-@ " INT
364 Number of BAM compression threads to use in addition to main thread [0].
365 .TP
366 .B -S
367 Ignored for compatibility with previous samtools versions.
368 Previously this option was required if input was in SAM format, but now the
369 correct format is automatically detected by examining the first few characters
370 of input.
371 .RE
372
373 .TP \"-------- sort
374 .B sort
375 .na
376 samtools sort
377 .RB [ -l
378 .IR level ]
379 .RB [ -m
380 .IR maxMem ]
381 .RB [ -o
382 .IR out.bam ]
383 .RB [ -O
384 .IR format ]
385 .RB [ -n ]
386 .RB [ -t
387 .IR tag ]
388 .RB [ -T
389 .IR tmpprefix ]
390 .RB [ -@
391 .IR threads "] [" in.sam | in.bam | in.cram ]
392 .ad
393
394 Sort alignments by leftmost coordinates, or by read name when
395 .B -n
396 is used.
397 An appropriate
398 .B @HD-SO
399 sort order header tag will be added or an existing one updated if necessary.
400
401 The sorted output is written to standard output by default, or to the
402 specified file
403 .RI ( out.bam )
404 when
405 .B -o
406 is used.
407 This command will also create temporary files
408 .IB tmpprefix . %d .bam
409 as needed when the entire alignment data cannot fit into memory
410 (as controlled via the
411 .B -m
412 option).
413
414 .B Options:
415 .RS
416 .TP 11
417 .BI "-l " INT
418 Set the desired compression level for the final output file, ranging from 0
419 (uncompressed) or 1 (fastest but minimal compression) to 9 (best compression
420 but slowest to write), similarly to
421 .BR gzip (1)'s
422 compression level setting.
423 .IP
424 If
425 .B -l
426 is not used, the default compression level will apply.
427 .TP
428 .BI "-m " INT
429 Approximately the maximum required memory per thread, specified either in bytes
430 or with a
431 .BR K ", " M ", or " G
432 suffix.
433 [768 MiB]
434 .IP
435 To prevent sort from creating a huge number of temporary files, it enforces a
436 minimum value of 1M for this setting.
437 .TP
438 .B -n
439 Sort by read names (i.e., the
440 .B QNAME
441 field) rather than by chromosomal coordinates.
442 .TP
443 .BI "-t " TAG
444 Sort first by the value in the alignment tag TAG, then by position or name (if
445 also using \fB-n\fP).
446 .BI "-o " FILE
447 Write the final sorted output to
448 .IR FILE ,
449 rather than to standard output.
450 .TP
451 .BI "-O " FORMAT
452 Write the final output as
453 .BR sam ", " bam ", or " cram .
454
455 By default, samtools tries to select a format based on the
456 .B -o
457 filename extension; if output is to standard output or no format can be
458 deduced,
459 .B bam
460 is selected.
461 .TP
462 .BI "-T " PREFIX
463 Write temporary files to
464 .IB PREFIX . nnnn .bam,
465 or if the specified
466 .I PREFIX
467 is an existing directory, to
468 .IB PREFIX /samtools. mmm . mmm .tmp. nnnn .bam,
469 where
470 .I mmm
471 is unique to this invocation of the
472 .B sort
473 command.
474 .IP
475 By default, any temporary files are written alongside the output file, as
476 .IB out.bam .tmp. nnnn .bam,
477 or if output is to standard output, in the current directory as
478 .BI samtools. mmm . mmm .tmp. nnnn .bam.
479 .TP
480 .BI "-@ " INT
481 Set number of sorting and compression threads.
482 By default, operation is single-threaded.
483 .PP
484 .B Ordering Rules
485
486 The following rules are used for ordering records.
487
488 If option \fB-t\fP is in use, records are first sorted by the value of
489 the given alignment tag, and then by position or name (if using \fB-n\fP).
490 For example, \*(lq-t RG\*(rq will make read group the primary sort key. The
491 rules for ordering by tag are:
492
493 .IP \(bu 4
494 Records that do not have the tag are sorted before ones that do.
495 .IP \(bu 4
496 If the types of the tags are different, they will be sorted so
497 that single character tags (type A) come before array tags (type B), then
498 string tags (types H and Z), then numeric tags (types f and i).
499 .IP \(bu 4
500 Numeric tags (types f and i) are compared by value. Note that comparisons
501 of floating-point values are subject to issues of rounding and precision.
502 .IP \(bu 4
503 String tags (types H and Z) are compared based on the binary
504 contents of the tag using the C
505 .BR strcmp (3)
506 function.
507 .IP \(bu 4
508 Character tags (type A) are compared by binary character value.
509 .IP \(bu 4
510 No attempt is made to compare tags of other types \(em notably type B
511 array values will not be compared.
512 .PP
513 When the \fB-n\fP option is present, records are sorted by name. Names are
514 compared so as to give a \*(lqnatural\*(rq ordering \(em i.e. sections
515 consisting of digits are compared numerically while all other sections are
516 compared based on their binary representation. This means \*(lqa1\*(rq will
517 come before \*(lqb1\*(rq and \*(lqa9\*(rq will come before \*(lqa10\*(rq.
518 Records with the same name will be ordered according to the values of
519 the READ1 and READ2 flags (see
520 .BR flags ).
521
522 When the \fB-n\fP option is
523 .B not
524 present, reads are sorted by reference (according to the order of the @SQ
525 header records), then by position in the reference, and then by the REVERSE
526 flag.
527
528 .B Note
529
530 .PP
531 Historically
532 .B samtools sort
533 also accepted a less flexible way of specifying the final and
534 temporary output filenames:
535 .IP
536 samtools sort
537 .RB [ -f "] [" -o ]
538 .I in.bam out.prefix
539 .PP
540 This has now been removed.
541 The previous \fIout.prefix\fP argument (and \fB-f\fP option, if any)
542 should be changed to an appropriate combination of \fB-T\fP \fIPREFIX\fP
543 and \fB-o\fP \fIFILE\fP. The previous \fB-o\fP option should be removed,
544 as output defaults to standard output.
545 .RE
546
547 .TP \"-------- index
548 .B index
549 samtools index
550 .RB [ -bc ]
551 .RB [ -m
552 .IR INT ]
553 .IR aln.bam | aln.cram
554 .RI [ out.index ]
555
556 Index a coordinate-sorted BAM or CRAM file for fast random access.
557 (Note that this does not work with SAM files even if they are bgzip
558 compressed \(em to index such files, use tabix(1) instead.)
559
560 This index is needed when
561 .I region
562 arguments are used to limit
563 .B samtools view
564 and similar commands to particular regions of interest.
565
566 If an output filename is given, the index file will be written to
567 .IR out.index .
568 Otherwise, for a CRAM file
569 .IR aln.cram ,
570 index file
571 .IB aln.cram .crai
572 will be created; for a BAM file
573 .IR aln.bam ,
574 either
575 .IB aln.bam .bai
576 or
577 .IB aln.bam .csi
578 will be created, depending on the index format selected.
579
580 .B Options:
581 .RS
582 .TP 8
583 .B -b
584 Create a BAI index.
585 This is currently the default when no format options are used.
586 .TP
587 .B -c
588 Create a CSI index.
589 By default, the minimum interval size for the index is 2^14, which is the same
590 as the fixed value used by the BAI format.
591 .TP
592 .BI "-m " INT
593 Create a CSI index, with a minimum interval size of 2^INT.
594 .RE
595
596 .TP \"-------- idxstats
597 .B idxstats
598 samtools idxstats
599 .IR in.sam | in.bam | in.cram
600
601 Retrieve and print stats in the index file corresponding to the input file.
602 Before calling idxstats, the input BAM file must be indexed by samtools index.
603
604 The output is TAB-delimited with each line consisting of reference sequence
605 name, sequence length, # mapped reads and # unmapped reads. It is written to
606 stdout.
607
608 .TP \"-------- flagstat
609 .B flagstat
610 samtools flagstat
611 .IR in.sam | in.bam | in.cram
612
613 Does a full pass through the input file to calculate and print statistics
614 to stdout.
615
616 Provides counts for each of 13 categories based primarily on bit flags in
617 the FLAG field. Each category in the output is broken down into QC pass and
618 QC fail, which is presented as "#PASS + #FAIL" followed by a description of
619 the category.
620
621 The first row of output gives the total number of reads that are QC pass and
622 fail (according to flag bit 0x200). For example:
623
624 122 + 28 in total (QC-passed reads + QC-failed reads)
625
626 Which would indicate that there are a total of 150 reads in the input file,
627 122 of which are marked as QC pass and 28 of which are marked as "not passing
628 quality controls"
629
630 Following this, additional categories are given for reads which are:
631
632 .RS 18
633 .TP
634 secondary
635 0x100 bit set
636 .TP
637 supplementary
638 0x800 bit set
639 .TP
640 duplicates
641 0x400 bit set
642 .TP
643 mapped
644 0x4 bit not set
645 .TP
646 paired in sequencing
647 0x1 bit set
648 .TP
649 read1
650 both 0x1 and 0x40 bits set
651 .TP
652 read2
653 both 0x1 and 0x80 bits set
654 .TP
655 properly paired
656 both 0x1 and 0x2 bits set and 0x4 bit not set
657 .TP
658 with itself and mate mapped
659 0x1 bit set and neither 0x4 nor 0x8 bits set
660 .TP
661 singletons
662 both 0x1 and 0x8 bits set and bit 0x4 not set
663 .RE
664
665 .RS 10
666 And finally, two rows are given that additionally filter on the reference
667 name (RNAME), mate reference name (MRNM), and mapping quality (MAPQ) fields:
668 .RE
669
670 .RS 18
671 .TP
672 with mate mapped to a different chr
673 0x1 bit set and neither 0x4 nor 0x8 bits set and MRNM not equal to RNAME
674 .TP
675 with mate mapped to a different chr (mapQ>=5)
676 0x1 bit set and neither 0x4 nor 0x8 bits set
677 and MRNM not equal to RNAME and MAPQ >= 5
678 .RE
679
680 .TP \"-------- stats
681 .B stats
682 samtools stats
683 .RI [ options ]
684 .IR in.sam | in.bam | in.cram
685 .RI [ region ...]
686
687 samtools stats collects statistics from BAM files and outputs in a text format.
688 The output can be visualized graphically using plot-bamstats.
689
690 .B Options:
691 .RS
692 .TP 8
693 .BI "-c, --coverage " MIN , MAX , STEP
694 Set coverage distribution to the specified range (MIN, MAX, STEP all given as integers)
695 [1,1000,1]
696 .TP
697 .B -d, --remove-dups
698 Exclude from statistics reads marked as duplicates
699 .TP
700 .BI "-f, --required-flag " STR "|" INT
701 Required flag, 0 for unset. See also `samtools flags`
702 [0]
703 .TP
704 .BI "-F, --filtering-flag " STR "|" INT
705 Filtering flag, 0 for unset. See also `samtools flags`
706 [0]
707 .TP
708 .BI "--GC-depth " FLOAT
709 the size of GC-depth bins (decreasing bin size increases memory requirement)
710 [2e4]
711 .TP
712 .B -h, --help
713 This help message
714 .TP
715 .BI "-i, --insert-size " INT
716 Maximum insert size
717 [8000]
718 .TP
719 .BI "-I, --id " STR
720 Include only listed read group or sample name
721 []
722 .TP
723 .BI "-l, --read-length " INT
724 Include in the statistics only reads with the given read length
725 []
726 .TP
727 .BI "-m, --most-inserts " FLOAT
728 Report only the main part of inserts
729 [0.99]
730 .TP
731 .BI "-P, --split-prefix " STR
732 A path or string prefix to prepend to filenames output when creating
733 categorised statistics files with
734 .BR -S / --split .
735 [input filename]
736 .TP
737 .BI "-q, --trim-quality " INT
738 The BWA trimming parameter
739 [0]
740 .TP
741 .BI "-r, --ref-seq " FILE
742 Reference sequence (required for GC-depth and mismatches-per-cycle calculation).
743 []
744 .TP
745 .BI "-S, --split " TAG
746 In addition to the complete statistics, also output categorised statistics
747 based on the tagged field
748 .I TAG
749 (e.g., use
750 .B --split RG
751 to split into read groups).
752
753 Categorised statistics are written to files named
754 .RI < prefix >_< value >.bamstat,
755 where
756 .I prefix
757 is as given by
758 .B --split-prefix
759 (or the input filename by default) and
760 .I value
761 has been encountered as the specified tagged field's value in one or more
762 alignment records.
763 .TP
764 .BI "-t, --target-regions " FILE
765 Do stats in these regions only. Tab-delimited file chr,from,to, 1-based, inclusive.
766 []
767 .TP
768 .B "-x, --sparse"
769 Suppress outputting IS rows where there are no insertions.
770 .RE
771
772 .TP \"-------- bedcov
773 .B bedcov
774 samtools bedcov
775 .RI [ options ]
776 .IR region.bed " " in1.sam | in1.bam | in1.cram "[...]"
777
778 Reports the total read base count (i.e. the sum of per base read depths)
779 for each genomic region specified in the supplied BED file.
780 Counts for each alignment file supplied are reported in separate columns.
781
782 .B Options:
783 .RS
784 .TP
785 .BI "-Q " INT
786 .RI "Only count reads with mapping quality greater than " INT
787 .RE
788
789 .TP \"-------- depth
790 .B depth
791 samtools depth
792 .RI [ options ]
793 .RI "[" in1.sam | in1.bam | in1.cram " [" in2.sam | in2.bam | in2.cram "] [...]]"
794
795 Computes the depth at each position or region.
796
797 .B Options:
798 .RS
799 .TP 8
800 .B -a
801 Output all positions (including those with zero depth)
802 .TP
803 .B -a -a, -aa
804 Output absolutely all positions, including unused reference sequences.
805 Note that when used in conjunction with a BED file the -a option may
806 sometimes operate as if -aa was specified if the reference sequence
807 has coverage outside of the region specified in the BED file.
808 .TP
809 .BI "-b " FILE
810 .RI "Compute depth at list of positions or regions in specified BED " FILE.
811 []
812 .TP
813 .BI "-f " FILE
814 .RI "Use the BAM files specified in the " FILE
815 (a file of filenames, one file per line)
816 []
817 .TP
818 .BI "-l " INT
819 .RI "Ignore reads shorter than " INT
820 .TP
821 .BI "-m, -d " INT
822 .RI "Truncate reported depth at a maximum of " INT " reads."
823 [8000]
824 .TP
825 .BI "-q " INT
826 .RI "Only count reads with base quality greater than " INT
827 .TP
828 .BI "-Q " INT
829 .RI "Only count reads with mapping quality greater than " INT
830 .TP
831 .BI "-r " CHR ":" FROM "-" TO
832 Only report depth in specified region.
833 .RE
834
835 .TP \"-------- merge
836 .B merge
837 samtools merge [-nur1f] [-h inh.sam] [-R reg] [-b <list>] <out.bam> <in1.bam> [<in2.bam> <in3.bam> ... <inN.bam>]
838
839 Merge multiple sorted alignment files, producing a single sorted output file
840 that contains all the input records and maintains the existing sort order.
841
842 If
843 .BR -h
844 is specified the @SQ headers of input files will be merged into the specified header, otherwise they will be merged
845 into a composite header created from the input headers. If in the process of merging @SQ lines for coordinate sorted
846 input files, a conflict arises as to the order (for example input1.bam has @SQ for a,b,c and input2.bam has b,a,c)
847 then the resulting output file will need to be re-sorted back into coordinate order.
848
849 Unless the
850 .BR -c
851 or
852 .BR -p
853 flags are specified then when merging @RG and @PG records into the output header then any IDs found to be duplicates
854 of existing IDs in the output header will have a suffix appended to them to differentiate them from similar header
855 records from other files and the read records will be updated to reflect this.
856
857 The ordering of the records in the input files must match the usage of the
858 \fB-n\fP and \fB-t\fP command-line options. If they do not, the output
859 order will be undefined. See
860 .B sort
861 for information about record ordering.
862
863 .B OPTIONS:
864 .RS
865 .TP 8
866 .B -1
867 Use zlib compression level 1 to compress the output.
868 .TP
869 .BI -b \ FILE
870 List of input BAM files, one file per line.
871 .TP
872 .B -f
873 Force to overwrite the output file if present.
874 .TP 8
875 .BI -h \ FILE
876 Use the lines of
877 .I FILE
878 as `@' headers to be copied to
879 .IR out.bam ,
880 replacing any header lines that would otherwise be copied from
881 .IR in1.bam .
882 .RI ( FILE
883 is actually in SAM format, though any alignment records it may contain
884 are ignored.)
885 .TP
886 .B -n
887 The input alignments are sorted by read names rather than by chromosomal
888 coordinates
889 .TP
890 .B -t TAG
891 The input alignments have been sorted by the value of TAG, then by either
892 position or name (if \fB-n\fP is given).
893 .TP
894 .BI -R \ STR
895 Merge files in the specified region indicated by
896 .I STR
897 [null]
898 .TP
899 .B -r
900 Attach an RG tag to each alignment. The tag value is inferred from file names.
901 .TP
902 .B -u
903 Uncompressed BAM output
904 .TP
905 .B -c
906 When several input files contain @RG headers with the same ID, emit only one
907 of them (namely, the header line from the first file we find that ID in) to
908 the merged output file.
909 Combining these similar headers is usually the right thing to do when the
910 files being merged originated from the same file.
911
912 Without \fB-c\fP, all @RG headers appear in the output file, with random
913 suffixes added to their IDs where necessary to differentiate them.
914 .TP
915 .B -p
916 Similarly, for each @PG ID in the set of files to merge, use the @PG line
917 of the first file we find that ID in rather than adding a suffix to
918 differentiate similar IDs.
919 .RE
920
921 .TP \"-------- faidx
922 .B faidx
923 samtools faidx <ref.fasta> [region1 [...]]
924
925 Index reference sequence in the FASTA format or extract subsequence from
926 indexed reference sequence. If no region is specified,
927 .B faidx
928 will index the file and create
929 .I <ref.fasta>.fai
930 on the disk. If regions are specified, the subsequences will be
931 retrieved and printed to stdout in the FASTA format.
932
933 The input file can be compressed in the
934 .B BGZF
935 format.
936
937 The sequences in the input file should all have different names.
938 If they do not, indexing will emit a warning about duplicate sequences and
939 retrieval will only produce subsequences from the first sequence with the
940 duplicated name.
941
942 .TP \"-------- tview
943 .B tview
944 samtools tview
945 .RB [ -p
946 .IR chr:pos ]
947 .RB [ -s
948 .IR STR ]
949 .RB [ -d
950 .IR display ]
951 .RI <in.sorted.bam>
952 .RI [ref.fasta]
953
954 Text alignment viewer (based on the ncurses library). In the viewer,
955 press `?' for help and press `g' to check the alignment start from a
956 region in the format like `chr10:10,000,000' or `=10,000,000' when
957 viewing the same reference sequence.
958
959 .B Options:
960 .RS
961 .TP 14
962 .BI -d \ display
963 Output as (H)tml or (C)urses or (T)ext
964 .TP
965 .BI -p \ chr:pos
966 Go directly to this position
967 .TP
968 .BI -s \ STR
969 Display only alignments from this sample or read group
970 .RE
971
972 .TP \"-------- split
973 .B split
974 samtools split
975 .RI [ options ]
976 .IR merged.sam | merged.bam | merged.cram
977
978 Splits a file by read group.
979
980 .B Options:
981 .RS
982 .TP 14
983 .BI "-u " FILE1
984 .RI "Put reads with no RG tag or an unrecognised RG tag into " FILE1
985 .TP
986 .BI "-u " FILE1 ":" FILE2
987 .RI "As above, but assigns an RG tag as given in the header of " FILE2
988 .TP
989 .BI "-f " STRING
990 Output filename format string (see below)
991 ["%*_%#.%."]
992 .TP
993 .B -v
994 Verbose output
995 .PP
996 Format string expansions:
997 .TS
998 center;
999 lb l .
1000 %% %
1001 %* basename
1002 %# @RG index
1003 %! @RG ID
1004 %. output format filename extension
1005 .TE
1006 .RE
1007
1008 .TP \"-------- quickcheck
1009 .B quickcheck
1010 samtools quickcheck
1011 .RI [ options ]
1012 .IR in.sam | in.bam | in.cram
1013 [ ... ]
1014
1015 Quickly check that input files appear to be intact. Checks that beginning of the
1016 file contains a valid header (all formats) containing at least one target
1017 sequence and then seeks to the end of the file and checks that an end-of-file
1018 (EOF) is present and intact (BAM only).
1019
1020 Data in the middle of the file is not read since that would be much more time
1021 consuming, so please note that this command will not detect internal corruption,
1022 but is useful for testing that files are not truncated before performing more
1023 intensive tasks on them.
1024
1025 This command will exit with a non-zero exit code if any input files don't have a
1026 valid header or are missing an EOF block. Otherwise it will exit successfully
1027 (with a zero exit code).
1028
1029 .B Options:
1030 .RS
1031 .TP 8
1032 .B -v
1033 Verbose output: will additionally print the names of all input files that don't
1034 pass the check to stdout. Multiple -v options will cause additional messages
1035 regarding check results to be printed to stderr.
1036 .RE
1037
1038 .TP \"-------- dict
1039 .B dict
1040 samtools dict <ref.fasta|ref.fasta.gz>
1041
1042 Create a sequence dictionary file from a fasta file.
1043
1044 .B OPTIONS:
1045 .RS
1046 .TP 11
1047 .BI -a,\ --assembly \ STR
1048 Specify the assembly for the AS tag.
1049 .TP
1050 .B -H,\ --no-header
1051 Do not print the @HD header line.
1052 .TP
1053 .BI -o,\ --output \ FILE
1054 Output to
1055 .I FILE
1056 [stdout].
1057 .TP
1058 .BI -s,\ --species \ STR
1059 Specify the species for the SP tag.
1060 .TP
1061 .BI -u,\ --uri \ STR
1062 Specify the URI for the UR tag. Defaults to
1063 the absolute path of
1064 .I ref.fasta
1065 unless reading from stdin.
1066 .RE
1067
1068 .TP \"-------- fixmate
1069 .B fixmate
1070 .na
1071 samtools fixmate
1072 .RB [ -rpc ]
1073 .RB [ -O
1074 .IR format ]
1075 .I in.nameSrt.bam out.bam
1076 .ad
1077
1078 Fill in mate coordinates, ISIZE and mate related flags from a
1079 name-sorted alignment.
1080
1081 .B OPTIONS:
1082 .RS
1083 .TP 11
1084 .B -r
1085 Remove secondary and unmapped reads.
1086 .TP
1087 .B -p
1088 Disable FR proper pair check.
1089 .TP
1090 .B -c
1091 Add template cigar ct tag.
1092 .TP
1093 .BI "-O " FORMAT
1094 Write the final output as
1095 .BR sam ", " bam ", or " cram .
1096
1097 By default, samtools tries to select a format based on the output
1098 filename extension; if output is to standard output or no format can be
1099 deduced,
1100 .B bam
1101 is selected.
1102 .RE
1103
1104 .TP \"-------- mpileup
1105 .B mpileup
1106 samtools mpileup
1107 .RB [ -EBugp ]
1108 .RB [ -C
1109 .IR capQcoef ]
1110 .RB [ -r
1111 .IR reg ]
1112 .RB [ -f
1113 .IR in.fa ]
1114 .RB [ -l
1115 .IR list ]
1116 .RB [ -Q
1117 .IR minBaseQ ]
1118 .RB [ -q
1119 .IR minMapQ ]
1120 .I in.bam
1121 .RI [ in2.bam
1122 .RI [ ... ]]
1123
1124 Generate VCF, BCF or pileup for one or multiple BAM files. Alignment records
1125 are grouped by sample (SM) identifiers in @RG header lines. If sample
1126 identifiers are absent, each input file is regarded as one sample.
1127
1128 In the pileup format (without
1129 .BR -u \ or \ -g ),
1130 each
1131 line represents a genomic position, consisting of chromosome name,
1132 1-based coordinate, reference base, the number of reads covering the site,
1133 read bases, base qualities and alignment
1134 mapping qualities. Information on match, mismatch, indel, strand,
1135 mapping quality and start and end of a read are all encoded at the read
1136 base column. At this column, a dot stands for a match to the reference
1137 base on the forward strand, a comma for a match on the reverse strand,
1138 a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward
1139 strand and `acgtn' for a mismatch on the reverse strand. A pattern
1140 `\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this
1141 reference position and the next reference position. The length of the
1142 insertion is given by the integer in the pattern, followed by the
1143 inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'
1144 represents a deletion from the reference. The deleted bases will be
1145 presented as `*' in the following lines. Also at the read base column, a
1146 symbol `^' marks the start of a read. The ASCII of the character
1147 following `^' minus 33 gives the mapping quality. A symbol `$' marks the
1148 end of a read segment.
1149
1150 Note that there are two orthogonal ways to specify locations in the
1151 input file; via \fB-r\fR \fIregion\fR and \fB-l\fR \fIfile\fR. The
1152 former uses (and requires) an index to do random access while the
1153 latter streams through the file contents filtering out the specified
1154 regions, requiring no index. The two may be used in conjunction. For
1155 example a BED file containing locations of genes in chromosome 20
1156 could be specified using \fB-r 20 -l chr20.bed\fR, meaning that the
1157 index is used to find chromosome 20 and then it is filtered for the
1158 regions listed in the bed file.
1159
1160 .B Input Options:
1161 .RS
1162 .TP 10
1163 .B -6, --illumina1.3+
1164 Assume the quality is in the Illumina 1.3+ encoding.
1165 .TP
1166 .B -A, --count-orphans
1167 Do not skip anomalous read pairs in variant calling.
1168 .TP
1169 .BI -b,\ --bam-list \ FILE
1170 List of input BAM files, one file per line [null]
1171 .TP
1172 .B -B, --no-BAQ
1173 Disable probabilistic realignment for the computation of base alignment
1174 quality (BAQ). BAQ is the Phred-scaled probability of a read base being
1175 misaligned. Applying this option greatly helps to reduce false SNPs
1176 caused by misalignments.
1177 .TP
1178 .BI -C,\ --adjust-MQ \ INT
1179 Coefficient for downgrading mapping quality for reads containing
1180 excessive mismatches. Given a read with a phred-scaled probability q of
1181 being generated from the mapped position, the new mapping quality is
1182 about sqrt((INT-q)/INT)*INT. A zero value disables this
1183 functionality; if enabled, the recommended value for BWA is 50. [0]
1184 .TP
1185 .BI -d,\ --max-depth \ INT
1186 At a position, read maximally
1187 .I INT
1188 reads per input file. Note that samtools has a minimum value of
1189 .I 8000/n
1190 where
1191 .I n
1192 is the number of input files given to mpileup. This means the default
1193 is highly likely to be increased. Once above the cross-sample minimum of
1194 8000 the -d parameter will have an effect. [250]
1195 .TP
1196 .B -E, --redo-BAQ
1197 Recalculate BAQ on the fly, ignore existing BQ tags
1198 .TP
1199 .BI -f,\ --fasta-ref \ FILE
1200 The
1201 .BR faidx -indexed
1202 reference file in the FASTA format. The file can be optionally compressed by
1203 .BR bgzip .
1204 [null]
1205 .TP
1206 .BI -G,\ --exclude-RG \ FILE
1207 Exclude reads from readgroups listed in FILE (one @RG-ID per line)
1208 .TP
1209 .BI -l,\ --positions \ FILE
1210 BED or position list file containing a list of regions or sites where
1211 pileup or BCF should be generated. Position list files contain two
1212 columns (chromosome and position) and start counting from 1. BED
1213 files contain at least 3 columns (chromosome, start and end position)
1214 and are 0-based half-open.
1215 .br
1216 While it is possible to mix both position-list and BED coordinates in
1217 the same file, this is strongly ill advised due to the differing
1218 coordinate systems. [null]
1219 .TP
1220 .BI -q,\ -min-MQ \ INT
1221 Minimum mapping quality for an alignment to be used [0]
1222 .TP
1223 .BI -Q,\ --min-BQ \ INT
1224 Minimum base quality for a base to be considered [13]
1225 .TP
1226 .BI -r,\ --region \ STR
1227 Only generate pileup in region. Requires the BAM files to be indexed.
1228 If used in conjunction with -l then considers the intersection of the
1229 two requests.
1230 .I STR
1231 [all sites]
1232 .TP
1233 .B -R,\ --ignore-RG
1234 Ignore RG tags. Treat all reads in one BAM as one sample.
1235 .TP
1236 .BI --rf,\ --incl-flags \ STR|INT
1237 Required flags: skip reads with mask bits unset [null]
1238 .TP
1239 .BI --ff,\ --excl-flags \ STR|INT
1240 Filter flags: skip reads with mask bits set
1241 [UNMAP,SECONDARY,QCFAIL,DUP]
1242 .TP
1243 .B -x,\ --ignore-overlaps
1244 Disable read-pair overlap detection.
1245 .PP
1246 .B Output Options:
1247 .TP 10
1248 .BI "-o, --output " FILE
1249 Write pileup or VCF/BCF output to
1250 .IR FILE ,
1251 rather than the default of standard output.
1252
1253 (The same short option is used for both
1254 .B --open-prob
1255 and
1256 .BR --output .
1257 If
1258 .BR -o 's
1259 argument contains any non-digit characters other than a leading + or - sign,
1260 it is interpreted as
1261 .BR --output .
1262 Usually the filename extension will take care of this, but to write to an
1263 entirely numeric filename use
1264 .B -o ./123
1265 or
1266 .BR "--output 123" .)
1267 .TP
1268 .B -g,\ --BCF
1269 Compute genotype likelihoods and output them in the binary call format (BCF).
1270 As of v1.0, this is BCF2 which is incompatible with the BCF1 format produced
1271 by previous (0.1.x) versions of samtools.
1272 .TP
1273 .B -v,\ --VCF
1274 Compute genotype likelihoods and output them in the variant call format (VCF).
1275 Output is bgzip-compressed VCF unless
1276 .B -u
1277 option is set.
1278 .PP
1279 .B Output Options for mpileup format (without -g or -v):
1280 .TP 10
1281 .B -O, --output-BP
1282 Output base positions on reads.
1283 .TP
1284 .B -s, --output-MQ
1285 Output mapping quality.
1286 .TP
1287 .B -a
1288 Output all positions, including those with zero depth.
1289 .TP
1290 .B -a -a, -aa
1291 Output absolutely all positions, including unused reference sequences.
1292 Note that when used in conjunction with a BED file the -a option may
1293 sometimes operate as if -aa was specified if the reference sequence
1294 has coverage outside of the region specified in the BED file.
1295 .PP
1296 .B Output Options for VCF/BCF format (with -g or -v):
1297 .TP 10
1298 .B -D
1299 Output per-sample read depth [DEPRECATED - use
1300 .B -t DP
1301 instead]
1302 .TP
1303 .B -S
1304 Output per-sample Phred-scaled strand bias P-value [DEPRECATED - use
1305 .B -t SP
1306 instead]
1307 .TP
1308 .BI -t,\ --output-tags \ LIST
1309 Comma-separated list of FORMAT and INFO tags to output (case-insensitive):
1310 .B AD
1311 (Allelic depth, FORMAT),
1312 .B INFO/AD
1313 (Total allelic depth, INFO),
1314 .B ADF
1315 (Allelic depths on the forward strand, FORMAT),
1316 .B INFO/ADF
1317 (Total allelic depths on the forward strand, INFO),
1318 .B ADR
1319 (Allelic depths on the reverse strand, FORMAT),
1320 .B INFO/ADR
1321 (Total allelic depths on the reverse strand, INFO),
1322 .B DP
1323 (Number of high-quality bases, FORMAT),
1324 .B DV
1325 (Deprecated in favor of AD; Number of high-quality non-reference bases, FORMAT),
1326 .B DPR
1327 (Deprecated in favor of AD; Number of high-quality bases for each observed allele, FORMAT),
1328 .B INFO/DPR
1329 (Number of high-quality bases for each observed allele, INFO),
1330 .B DP4
1331 (Deprecated in favor of ADF and ADR; Number of high-quality ref-forward, ref-reverse, alt-forward and alt-reverse bases, FORMAT),
1332 .B SP
1333 (Phred-scaled strand bias P-value, FORMAT)
1334 [null]
1335 .TP
1336 .B -u,\ --uncompressed
1337 Generate uncompressed VCF/BCF output, which is preferred for piping.
1338 .TP
1339 .B -V
1340 Output per-sample number of non-reference reads [DEPRECATED - use
1341 .B -t DV
1342 instead]
1343 .PP
1344 .B Options for SNP/INDEL Genotype Likelihood Computation (for -g or -v):
1345 .TP 10
1346 .BI -e,\ --ext-prob \ INT
1347 Phred-scaled gap extension sequencing error probability. Reducing
1348 .I INT
1349 leads to longer indels. [20]
1350 .TP
1351 .BI -F,\ --gap-frac \ FLOAT
1352 Minimum fraction of gapped reads [0.002]
1353 .TP
1354 .BI -h,\ --tandem-qual \ INT
1355 Coefficient for modeling homopolymer errors. Given an
1356 .IR l -long
1357 homopolymer
1358 run, the sequencing error of an indel of size
1359 .I s
1360 is modeled as
1361 .IR INT * s / l .
1362 [100]
1363 .TP
1364 .B -I, --skip-indels
1365 Do not perform INDEL calling
1366 .TP
1367 .BI -L,\ --max-idepth \ INT
1368 Skip INDEL calling if the average per-input-file depth is above
1369 .IR INT .
1370 [250]
1371 .TP
1372 .BI -m,\ --min-ireads \ INT
1373 Minimum number gapped reads for indel candidates
1374 .IR INT .
1375 [1]
1376 .TP
1377 .BI -o,\ --open-prob \ INT
1378 Phred-scaled gap open sequencing error probability. Reducing
1379 .I INT
1380 leads to more indel calls. [40]
1381
1382 (The same short option is used for both
1383 .B --open-prob
1384 and
1385 .BR --output .
1386 When
1387 .BR -o 's
1388 argument contains only an optional + or - sign followed by the digits 0 to 9,
1389 it is interpreted as
1390 .BR --open-prob .)
1391 .TP
1392 .B -p, --per-sample-mF
1393 Apply
1394 .B -m
1395 and
1396 .B -F
1397 thresholds per sample to increase sensitivity of calling.
1398 By default both options are applied to reads pooled from all samples.
1399 .TP
1400 .BI -P,\ --platforms \ STR
1401 Comma-delimited list of platforms (determined by
1402 .BR @RG-PL )
1403 from which indel candidates are obtained. It is recommended to collect
1404 indel candidates from sequencing technologies that have low indel error
1405 rate such as ILLUMINA. [all]
1406 .RE
1407
1408 .TP \"-------- flags
1409 .B flags
1410 samtools flags INT|STR[,...]
1411
1412 Convert between textual and numeric flag representation.
1413
1414 .B FLAGS:
1415 .TS
1416 rb l l .
1417 0x1 PAIRED paired-end (or multiple-segment) sequencing technology
1418 0x2 PROPER_PAIR each segment properly aligned according to the aligner
1419 0x4 UNMAP segment unmapped
1420 0x8 MUNMAP next segment in the template unmapped
1421 0x10 REVERSE SEQ is reverse complemented
1422 0x20 MREVERSE SEQ of the next segment in the template is reverse complemented
1423 0x40 READ1 the first segment in the template
1424 0x80 READ2 the last segment in the template
1425 0x100 SECONDARY secondary alignment
1426 0x200 QCFAIL not passing quality controls
1427 0x400 DUP PCR or optical duplicate
1428 0x800 SUPPLEMENTARY supplementary alignment
1429 .TE
1430
1431 .TP \"-------- fastq fasta
1432 .B fastq/a
1433 samtools fastq
1434 .RI [ options ]
1435 .I in.bam
1436 .br
1437 samtools fasta
1438 .RI [ options ]
1439 .I in.bam
1440
1441 Converts a BAM or CRAM into either FASTQ or FASTA format depending on the
1442 command invoked. The FASTQ files will be automatically compressed if the
1443 filenames have a .gz or .bgzf extention.
1444
1445 .B OPTIONS:
1446 .RS
1447 .TP 8
1448 .B -n
1449 By default, either '/1' or '/2' is added to the end of read names
1450 where the corresponding BAM_READ1 or BAM_READ2 flag is set.
1451 Using
1452 .B -n
1453 causes read names to be left as they are.
1454 .TP 8
1455 .B -N
1456 Always add either '/1' or '/2' to the end of read names
1457 even when put into different files.
1458 .TP 8
1459 .B -O
1460 Use quality values from OQ tags in preference to standard quality string
1461 if available.
1462 .TP 8
1463 .B -s FILE
1464 Write singleton reads in FASTQ format to FILE instead of outputting them.
1465 .TP 8
1466 .B -t
1467 Copy RG, BC and QT tags to the FASTQ header line, if they exist.
1468 .TP 8
1469 .B -T TAGLIST
1470 Specify a comma-separated list of tags to copy to the FASTQ header line, if they exist.
1471 .TP 8
1472 .B -1 FILE
1473 Write reads with the BAM_READ1 flag set to FILE instead of outputting them.
1474 .TP 8
1475 .B -2 FILE
1476 Write reads with the BAM_READ2 flag set to FILE instead of outputting them.
1477 .TP 8
1478 .B -0 FILE
1479 Write reads with both or neither of the BAM_READ1 and BAM_READ2 flags set
1480 to FILE instead of outputting them.
1481 .TP 8
1482 .BI "-f " INT
1483 Only output alignments with all bits set in
1484 .I INT
1485 present in the FLAG field.
1486 .I INT
1487 can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
1488 or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
1489 .TP 8
1490 .BI "-F " INT
1491 Do not output alignments with any bits set in
1492 .I INT
1493 present in the FLAG field.
1494 .I INT
1495 can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
1496 or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
1497 .TP 8
1498 .BI "-G " INT
1499 Only EXCLUDE reads with all of the bits set in
1500 .I INT
1501 present in the FLAG field.
1502 .I INT
1503 can be specified in hex by beginning with `0x' (i.e. /^0x[0-9A-F]+/)
1504 or in octal by beginning with `0' (i.e. /^0[0-7]+/) [0].
1505 .TP 8
1506 .B -i
1507 add Illumina Casava 1.8 format entry to header (eg 1:N:0:ATCACG)
1508 .TP 8
1509 .B -c [0..9]
1510 set compression level when writing gz or bgzf fastq files.
1511 .TP 8
1512 .B --i1 FILE
1513 write first index reads to FILE
1514 .TP 8
1515 .B --i2 FILE
1516 write second index reads to FILE
1517 .TP 8
1518 .B --barcode-tag TAG
1519 aux tag to find index reads in [default: BC]
1520 .TP 8
1521 .B --quality-tag TAG
1522 aux tag to find index quality in [default: QT]
1523 .TP 8
1524 .B --index-format STR
1525 string to describe how to parse the barcode and quality tags. For example:
1526
1527 .RS
1528 .TP 8
1529 .B i14i8
1530 the first 14 characters are index 1, the next 8 characters are index 2
1531 .TP 8
1532 .B n8i14
1533 ignore the first 8 characters, and use the next 14 characters for index 1
1534
1535 If the tag contains a separator, then the numeric part can be replaced with '*' to
1536 mean 'read until the separator or end of tag', for example:
1537 .TP 8
1538 .B n*i*
1539 ignore the left part of the tag until the separator, then use the second part
1540 .RE
1541 .RE
1542
1543 .TP \"-------- collate
1544 .B collate
1545 samtools collate
1546 .RI [ options ]
1547 .IR in.sam | in.bam | in.cram " [" out.prefix "]"
1548
1549 Shuffles and groups reads together by their names.
1550 A faster alternative to a full query name sort,
1551 .B collate
1552 ensures that reads of the same name are grouped together in contiguous groups,
1553 but doesn't make any guarantees about the order of read names between groups.
1554
1555 The output from this command should be suitable for any operation that
1556 requires all reads from the same template to be grouped together.
1557
1558 .B Options:
1559 .RS
1560 .TP 8
1561 .B -O
1562 Output to stdout rather than to files starting with out.prefix
1563 .TP
1564 .B -u
1565 Write uncompressed BAM output
1566 .TP
1567 .BI "-l " INT
1568 Compression level.
1569 [1]
1570 .TP
1571 .BI "-n " INT
1572 Number of temporary files to use.
1573 [64]
1574 .RE
1575
1576 .TP \"-------- reheader
1577 .B reheader
1578 samtools reheader
1579 .RB [ -iP ]
1580 .I in.header.sam in.bam
1581
1582 Replace the header in
1583 .I in.bam
1584 with the header in
1585 .IR in.header.sam .
1586 This command is much faster than replacing the header with a
1587 BAM\(->SAM\(->BAM conversion.
1588
1589 By default this command outputs the BAM or CRAM file to standard
1590 output (stdout), but for CRAM format files it has the option to
1591 perform an in-place edit, both reading and writing to the same file.
1592 No validity checking is performed on the header, nor that it is suitable
1593 to use with the sequence data itself.
1594
1595 .B OPTIONS:
1596 .RS
1597 .TP 8
1598 .B -P, --no-PG
1599 Do not generate an @PG header line.
1600 .TP 8
1601 .B -i, --in-place
1602 Perform the header edit in-place, if possible. This only works on CRAM
1603 files and only if there is sufficient room to store the new header.
1604 The amount of space available will differ for each CRAM file.
1605 .RE
1606
1607 .TP \"-------- cat
1608 .B cat
1609 samtools cat [-b list] [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ]
1610
1611 Concatenate BAMs or CRAMs. Although this works on either BAM or CRAM,
1612 all input files must be the same format as each other. The sequence
1613 dictionary of each input file must be identical, although this command
1614 does not check this. This command uses a similar trick to
1615 .B reheader
1616 which enables fast BAM concatenation.
1617
1618 .B OPTIONS:
1619 .RS
1620 .TP 8
1621 .BI "-b " FOFN
1622 Read the list of input BAM or CRAM files from \fIFOFN\fR. These are
1623 concatenated prior to any files specified on the command line.
1624 Multiple \fB-b\fR \fIFOFN\fR options may be specified to concatenate
1625 multiple lists of BAM/CRAM files.
1626 .TP 8
1627 .BI "-h " FILE
1628 Uses the SAM header from \fIFILE\fR. By default the header is taken
1629 from the first file to be concatenated.
1630 .TP 8
1631 .BI "-o " FILE
1632 Write the concatenated output to \fIFILE\fR. By default this is sent
1633 to stdout.
1634 .RE
1635
1636 .TP \"-------- rmdup
1637 .B rmdup
1638 samtools rmdup [-sS] <input.srt.bam> <out.bam>
1639
1640 Remove potential PCR duplicates: if multiple read pairs have identical
1641 external coordinates, only retain the pair with highest mapping quality.
1642 In the paired-end mode, this command
1643 .B ONLY
1644 works with FR orientation and requires ISIZE is correctly set. It does
1645 not work for unpaired reads (e.g. two ends mapped to different
1646 chromosomes or orphan reads).
1647
1648 .B OPTIONS:
1649 .RS
1650 .TP 8
1651 .B -s
1652 Remove duplicates for single-end reads. By default, the command works for
1653 paired-end reads only.
1654 .TP 8
1655 .B -S
1656 Treat paired-end reads and single-end reads.
1657 .RE
1658
1659 .TP \"-------- addreplacerg
1660 .B addreplacerg
1661 samtools addreplacerg [-r rg line | -R rg ID] [-m mode] [-l level] [-o out.bam]
1662 <input.bam>
1663
1664 Adds or replaces read group tags in a file.
1665
1666 .B OPTIONS:
1667 .RS
1668 .TP 8
1669 .BI "-r " STRING
1670 Allows you to specify a read group line to append to the header and applies it
1671 to the reads specified by the -m option. If repeated it automatically adds in
1672 tabs between invocations.
1673 .TP 8
1674 .BI "-R " STRING
1675 Allows you to specify the read group ID of an existing @RG line and applies it
1676 to the reads specified.
1677 .TP 8
1678 .BI "-m " MODE
1679 If you choose orphan_only then existing RG tags are not overwritten, if you choose
1680 overwrite_all, existing RG tags are overwritten. The default is overwrite_all.
1681 .TP 8
1682 .BI "-o " STRING
1683 Write the final output to STRING. The default is to write to stdout.
1684
1685 By default, samtools tries to select a format based on the output
1686 filename extension; if output is to standard output or no format can be
1687 deduced,
1688 .B bam
1689 is selected.
1690 .RE
1691
1692 .TP \"-------- calmd
1693 .B calmd
1694 samtools calmd [-Eeubr] [-C capQcoef] <aln.bam> <ref.fasta>
1695
1696 Generate the MD tag. If the MD tag is already present, this command will
1697 give a warning if the MD tag generated is different from the existing
1698 tag. Output SAM by default.
1699
1700 Calmd can also read and write CRAM files although in most cases it is
1701 pointless as CRAM recalculates MD and NM tags on the fly. The one
1702 exception to this case is where both input and output CRAM files
1703 have been / are being created with the \fIno_ref\fR option.
1704
1705 .B OPTIONS:
1706 .RS
1707 .TP 8
1708 .B -A
1709 When used jointly with
1710 .B -r
1711 this option overwrites the original base quality.
1712 .TP 8
1713 .B -e
1714 Convert a the read base to = if it is identical to the aligned reference
1715 base. Indel caller does not support the = bases at the moment.
1716 .TP
1717 .B -u
1718 Output uncompressed BAM
1719 .TP
1720 .B -b
1721 Output compressed BAM
1722 .TP
1723 .BI -C \ INT
1724 Coefficient to cap mapping quality of poorly mapped reads. See the
1725 .B pileup
1726 command for details. [0]
1727 .TP
1728 .B -r
1729 Compute the BQ tag (without -A) or cap base quality by BAQ (with -A).
1730 .TP
1731 .B -E
1732 Extended BAQ calculation. This option trades specificity for sensitivity, though the
1733 effect is minor.
1734 .RE
1735
1736 .TP \"-------- targetcut
1737 .B targetcut
1738 samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam>
1739
1740 This command identifies target regions by examining the continuity of read depth, computes
1741 haploid consensus sequences of targets and outputs a SAM with each sequence corresponding
1742 to a target. When option
1743 .B -f
1744 is in use, BAQ will be applied. This command is
1745 .B only
1746 designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)].
1747
1748 .TP \"-------- phase
1749 .B phase
1750 samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] <in.bam>
1751
1752 Call and phase heterozygous SNPs.
1753
1754 .B OPTIONS:
1755 .RS
1756 .TP 8
1757 .B -A
1758 Drop reads with ambiguous phase.
1759 .TP 8
1760 .BI -b \ STR
1761 Prefix of BAM output. When this option is in use, phase-0 reads will be saved in file
1762 .BR STR .0.bam
1763 and phase-1 reads in
1764 .BR STR .1.bam.
1765 Phase unknown reads will be randomly allocated to one of the two files. Chimeric reads
1766 with switch errors will be saved in
1767 .BR STR .chimeric.bam.
1768 [null]
1769 .TP
1770 .B -F
1771 Do not attempt to fix chimeric reads.
1772 .TP
1773 .BI -k \ INT
1774 Maximum length for local phasing. [13]
1775 .TP
1776 .BI -q \ INT
1777 Minimum Phred-scaled LOD to call a heterozygote. [40]
1778 .TP
1779 .BI -Q \ INT
1780 Minimum base quality to be used in het calling. [13]
1781 .RE
1782
1783 .TP \"-------- depad
1784 .B depad
1785 samtools depad [-SsCu1] [-T ref.fa] [-o output] <in.bam>
1786
1787 Converts a BAM aligned against a padded reference to a BAM aligned
1788 against the depadded reference. The padded reference may contain
1789 verbatim "*" bases in it, but "*" bases are also counted in the
1790 reference numbering. This means that a sequence base-call aligned
1791 against a reference "*" is considered to be a cigar match ("M" or "X")
1792 operator (if the base-call is "A", "C", "G" or "T"). After depadding
1793 the reference "*" bases are deleted and such aligned sequence
1794 base-calls become insertions. Similarly transformations apply for
1795 deletions and padding cigar operations.
1796
1797 .B OPTIONS:
1798 .RS
1799 .TP
1800 .B -S
1801 Ignored for compatibility with previous samtools versions.
1802 Previously this option was required if input was in SAM format, but now the
1803 correct format is automatically detected by examining the first few characters
1804 of input.
1805 .TP
1806 .B -s
1807 Output in SAM format. The default is BAM.
1808 .TP
1809 .B -C
1810 Output in CRAM format. The default is BAM.
1811 .TP
1812 .B -u
1813 Do not compress the output. Applies to either BAM or CRAM output
1814 format.
1815 .TP
1816 .B -1
1817 Enable fastest compression level. Only works for BAM or CRAM output.
1818 .TP
1819 .BI "-T " FILE
1820 Provides the padded reference file. Note that without this the @SQ
1821 line lengths will be incorrect, so for most use cases this option will
1822 be considered as mandatory.
1823 .TP
1824 .BI "-o " FILE
1825 Specifies the output filename. By default output is sent to stdout.
1826 .RE
1827
1828 .TP \"-------- help etc
1829 .BR help ,\ --help
1830 Display a brief usage message listing the samtools commands available.
1831 If the name of a command is also given, e.g.,
1832 .BR samtools\ help\ view ,
1833 the detailed usage message for that particular command is displayed.
1834
1835 .TP
1836 .B --version
1837 Display the version numbers and copyright information for samtools and
1838 the important libraries used by samtools.
1839
1840 .TP
1841 .B --version-only
1842 Display the full samtools version number in a machine-readable format.
1843 .PP
1844 .SH GLOBAL OPTIONS
1845 .PP
1846 Several long-options are shared between multiple samtools subcommands:
1847 \fB--input-fmt\fR, \fB--input-fmt-options\fR, \fB--output-fmt\fR,
1848 \fB--output-fmt-options\fR, and \fB--reference\fR.
1849 The input format is typically auto-detected so specifying the format
1850 is usually unnecessary and the option is included for completeness.
1851 Note that not all subcommands have all options. Consult the subcommand
1852 help for more details.
1853 .PP
1854 Format strings recognised are "sam", "bam" and "cram". They may be
1855 followed by a comma separated list of options as \fIkey\fR or
1856 \fIkey\fR=\fIvalue\fR. See below for examples.
1857 .PP
1858 The \fBfmt-options\fR arguments accept either a single \fIoption\fR or
1859 \fIoption\fR=\fIvalue\fR. Note that some options only work on some
1860 file formats and only on read or write streams. If value is
1861 unspecified for a boolean option, the value is assumed to be 1. The
1862 valid options are as follows.
1863 .RS 0
1864 .\" General purpose
1865 .TP 4
1866 .BI nthreads= INT
1867 Specifies the number of threads to use during encoding and/or
1868 decoding. For BAM this will be encoding only. In CRAM the threads
1869 are dynamically shared between encoder and decoder.
1870 .\" CRAM specific
1871 .TP
1872 .BI reference= fasta_file
1873 Specifies a FASTA reference file for use in CRAM encoding or decoding.
1874 It usually is not required for decoding except in the situation of the
1875 MD5 not being obtainable via the REF_PATH or REF_CACHE environment variables.
1876 .TP
1877 .BI decode_md= 0|1
1878 CRAM input only; defaults to 1 (on). CRAM does not typically store
1879 MD and NM tags, preferring to generate them on the fly. This option
1880 controls this behaviour.
1881 .TP
1882 .BI ignore_md5= 0|1
1883 CRAM input only; defaults to 0 (off). When enabled, md5 checksum
1884 errors on the reference sequence and block checksum errors within CRAM
1885 are ignored. Use of this option is strongly discouraged.
1886 .TP
1887 .BI required_fields= bit-field
1888 CRAM input only; specifies which SAM columns need to be populated.
1889 By default all fields are used. Limiting the decode to specific
1890 columns can have significant performance gains. The bit-field is a
1891 numerical value constructed from the following table.
1892 .TS
1893 center;
1894 rb l .
1895 0x1 SAM_QNAME
1896 0x2 SAM_FLAG
1897 0x4 SAM_RNAME
1898 0x8 SAM_POS
1899 0x10 SAM_MAPQ
1900 0x20 SAM_CIGAR
1901 0x40 SAM_RNEXT
1902 0x80 SAM_PNEXT
1903 0x100 SAM_TLEN
1904 0x200 SAM_SEQ
1905 0x400 SAM_QUAL
1906 0x800 SAM_AUX
1907 0x1000 SAM_RGAUX
1908 .TE
1909 .TP
1910 .BI name_prefix= string
1911 CRAM input only; defaults to output filename. Any sequences with
1912 auto-generated read names will use \fIstring\fR as the name prefix.
1913 .TP
1914 .BI multi_seq_per_slice= 0|1
1915 CRAM output only; defaults to 0 (off). By default CRAM generates one
1916 container per reference sequence, except in the case of many small
1917 references (such as a fragmented assembly).
1918 .TP
1919 .BI version= major.minor
1920 CRAM output only. Specifies the CRAM version number. Acceptable
1921 values are "2.1" and "3.0".
1922 .TP
1923 .BI seqs_per_slice= INT
1924 CRAM output only; defaults to 10000.
1925 .TP
1926 .BI slices_per_container= INT
1927 CRAM output only; defaults to 1. The effect of having multiple slices
1928 per container is to share the compression header block between
1929 multiple slices. This is unlikely to have any significant impact
1930 unless the number of sequences per slice is reduced. (Together these
1931 two options control the granularity of random access.)
1932 .TP
1933 .BI embed_ref= 0|1
1934 CRAM output only; defaults to 0 (off). If 1, this will store portions
1935 of the reference sequence in each slice, permitting decode without
1936 having requiring an external copy of the reference sequence.
1937 .TP
1938 .BI no_ref= 0|1
1939 CRAM output only; defaults to 0 (off). If 1, sequences will be stored
1940 verbatim with no reference encoding. This can be useful if no
1941 reference is available for the file.
1942 .TP
1943 .BI use_bzip2= 0|1
1944 CRAM output only; defaults to 0 (off). Permits use of bzip2 in CRAM
1945 block compression.
1946 .TP
1947 .BI use_lzma= 0|1
1948 CRAM output only; defaults to 0 (off). Permits use of lzma in CRAM
1949 block compression.
1950 .TP
1951 .BI lossy_names= 0|1
1952 CRAM output only; defaults to 0 (off). If 1, templates with all
1953 members within the same CRAM slice will have their read names
1954 removed. New names will be automatically generated during decoding.
1955 Also see the \fBname_prefix\fR option.
1956 .RE
1957 .PP
1958 For example:
1959 .EX 4
1960 samtools view --input-fmt-option decode_md=0
1961 --output-fmt cram,version=3.0 --output-fmt-option embed_ref
1962 --output-fmt-option seqs_per_slice=2000 -o foo.cram foo.bam
1963 .EE
1964 .PP
1965 .SH REFERENCE SEQUENCES
1966 .PP
1967 The CRAM format requires use of a reference sequence for both reading
1968 and writing.
1969 .PP
1970 When reading a CRAM the \fB@SQ\fR headers are interrogated to identify
1971 the reference sequence MD5sum (\fBM5:\fR tag) and the local reference
1972 sequence filename (\fBUR:\fR tag). Note that \fIhttp://\fR and
1973 \fIftp://\fR based URLs in the UR: field are not used, but local fasta
1974 filenames (with or without \fIfile://\fR) can be used.
1975 .PP
1976 To create a CRAM the \fB@SQ\fR headers will also be read to identify
1977 the reference sequences, but M5: and UR: tags may not be present. In
1978 this case the \fB-T\fR and \fB-t\fR options of samtools view may be
1979 used to specify the fasta or fasta.fai filenames respectively
1980 (provided the .fasta.fai file is also backed up by a .fasta file).
1981 .PP
1982 The search order to obtain a reference is:
1983 .IP 1. 3
1984 Use any local file specified by the command line options (eg -T).
1985 .IP 2. 3
1986 Look for MD5 via REF_CACHE environment variable.
1987 .IP 3. 3
1988 Look for MD5 in each element of the REF_PATH environment variable.
1989 .IP 4. 3
1990 Look for a local file listed in the UR: header tag.
1991 .PP
1992 .SH ENVIRONMENT VARIABLES
1993 .PP
1994 .TP
1995 .B HTS_PATH
1996 A colon-separated list of directories in which to search for HTSlib plugins.
1997 If $HTS_PATH starts or ends with a colon or contains a double colon (\fB::\fP),
1998 the built-in list of directories is searched at that point in the search.
1999
2000 If no HTS_PATH variable is defined, the built-in list of directories
2001 specified when HTSlib was built is used, which typically includes
2002 \fB/usr/local/libexec/htslib\fP and similar directories.
2003
2004 .TP
2005 .B REF_PATH
2006 A colon separated (semi-colon on Windows) list of locations in which
2007 to look for sequences identified by their MD5sums. This can be either
2008 a list of directories or URLs. Note that if a URL is included then the
2009 colon in http:// and ftp:// and the optional port number will be
2010 treated as part of the URL and not a PATH field separator.
2011 For URLs, the text \fB%s\fR will be replaced by the MD5sum being
2012 read.
2013
2014 If no REF_PATH has been specified it will default to
2015 \fBhttp://www.ebi.ac.uk/ena/cram/md5/%s\fR and if REF_CACHE is also unset,
2016 it will be set to \fB$XDG_CACHE_HOME/hts-ref/%2s/%2s/%s\fR.
2017 If \fB$XDG_CACHE_HOME\fR is unset, \fB$HOME/.cache\fR (or a local system
2018 temporary directory if no home directory is found) will be used similarly.
2019
2020 .TP
2021 .B REF_CACHE
2022 This can be defined to a single directory housing a local cache of
2023 references. Upon downloading a reference it will be stored in the
2024 location pointed to by REF_CACHE. When reading a reference it will be
2025 looked for in this directory before searching REF_PATH. To avoid many
2026 files being stored in the same directory, a pathname may be
2027 constructed using %\fInum\fRs and %s notation, consuming \fInum\fR
2028 characters of the MD5sum. For example
2029 \fB/local/ref_cache/%2s/%2s/%s\fR will create 2 nested subdirectories
2030 with the filenames in the deepest directory being the last 28
2031 characters of the md5sum.
2032
2033 The REF_CACHE directory will be searched for before attempting to load
2034 via the REF_PATH search list. If no REF_PATH is defined, both
2035 REF_PATH and REF_CACHE will be automatically set (see above), but if
2036 REF_PATH is defined and REF_CACHE not then no local cache is used.
2037
2038 To aid population of the REF_CACHE directory a script
2039 \fBmisc/seq_cache_populate.pl\fR is provided in the Samtools
2040 distribution. This takes a fasta file or a directory of fasta files
2041 and generates the MD5sum named files.
2042 .PP
2043 .SH EXAMPLES
2044 .IP o 2
2045 Import SAM to BAM when
2046 .B @SQ
2047 lines are present in the header:
2048 .EX 2
2049 samtools view -bS aln.sam > aln.bam
2050 .EE
2051 If
2052 .B @SQ
2053 lines are absent:
2054 .EX 2
2055 samtools faidx ref.fa
2056 samtools view -bt ref.fa.fai aln.sam > aln.bam
2057 .EE
2058 where
2059 .I ref.fa.fai
2060 is generated automatically by the
2061 .B faidx
2062 command.
2063
2064 .IP o 2
2065 Convert a BAM file to a CRAM file using a local reference sequence.
2066 .EX 2
2067 samtools view -C -T ref.fa aln.bam > aln.cram
2068 .EE
2069 .IP o 2
2070 Attach the
2071 .B RG
2072 tag while merging sorted alignments:
2073 .EX 2
2074 perl -e 'print "@RG\\tID:ga\\tSM:hs\\tLB:ga\\tPL:Illumina\\n@RG\\tID:454\\tSM:hs\\tLB:454\\tPL:454\\n"' > rg.txt
2075 samtools merge -rh rg.txt merged.bam ga.bam 454.bam
2076 .EE
2077 The value in a
2078 .B RG
2079 tag is determined by the file name the read is coming from. In this
2080 example, in the
2081 .IR merged.bam ,
2082 reads from
2083 .I ga.bam
2084 will be attached
2085 .IR RG:Z:ga ,
2086 while reads from
2087 .I 454.bam
2088 will be attached
2089 .IR RG:Z:454 .
2090
2091 .IP o 2
2092 Call SNPs and short INDELs:
2093 .EX 2
2094 samtools mpileup -uf ref.fa aln.bam | bcftools call -mv > var.raw.vcf
2095 bcftools filter -s LowQual -e '%QUAL<20 || DP>100' var.raw.vcf > var.flt.vcf
2096 .EE
2097 The
2098 .B bcftools filter
2099 command marks low quality sites and sites with the read depth exceeding
2100 a limit, which should be adjusted to about twice the average read depth
2101 (bigger read depths usually indicate problematic regions which are
2102 often enriched for artefacts). One may consider to add
2103 .B -C50
2104 to
2105 .B mpileup
2106 if mapping quality is overestimated for reads containing excessive
2107 mismatches. Applying this option usually helps
2108 .B BWA-short
2109 but may not other mappers.
2110
2111 Individuals are identified from the
2112 .B SM
2113 tags in the
2114 .B @RG
2115 header lines. Individuals can be pooled in one alignment file; one
2116 individual can also be separated into multiple files. The
2117 .B -P
2118 option specifies that indel candidates should be collected only from
2119 read groups with the
2120 .B @RG-PL
2121 tag set to
2122 .IR ILLUMINA .
2123 Collecting indel candidates from reads sequenced by an indel-prone
2124 technology may affect the performance of indel calling.
2125
2126 .IP o 2
2127 Generate the consensus sequence for one diploid individual:
2128 .EX 2
2129 samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
2130 .EE
2131 .IP o 2
2132 Phase one individual:
2133 .EX 2
2134 samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out
2135 .EE
2136 The
2137 .B calmd
2138 command is used to reduce false heterozygotes around INDELs.
2139
2140
2141 .IP o 2
2142 Dump BAQ applied alignment for other SNP callers:
2143 .EX 2
2144 samtools calmd -bAr aln.bam > aln.baq.bam
2145 .EE
2146 It adds and corrects the
2147 .B NM
2148 and
2149 .B MD
2150 tags at the same time. The
2151 .B calmd
2152 command also comes with the
2153 .B -C
2154 option, the same as the one in
2155 .B pileup
2156 and
2157 .BR mpileup .
2158 Apply if it helps.
2159
2160 .SH LIMITATIONS
2161 .PP
2162 .IP o 2
2163 Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c.
2164 .IP o 2
2165 Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan
2166 reads or ends mapped to different chromosomes). If this is a concern,
2167 please use Picard's MarkDuplicates which correctly handles these cases,
2168 although a little slower.
2169
2170 .SH AUTHOR
2171 .PP
2172 Heng Li from the Sanger Institute wrote the original C version of samtools.
2173 Bob Handsaker from the Broad Institute implemented the BGZF library.
2174 James Bonfield from the Sanger Institute developed the CRAM implementation.
2175 John Marshall and Petr Danecek contribute to the source code and various
2176 people from the 1000 Genomes Project have contributed to the SAM format
2177 specification.
2178
2179 .SH SEE ALSO
2180 .IR bcftools (1),
2181 .IR sam (5),
2182 .IR tabix (1)
2183 .PP
2184 Samtools website: <http://www.htslib.org/>
2185 .br
2186 File format specification of SAM/BAM,CRAM,VCF/BCF: <http://samtools.github.io/hts-specs>
2187 .br
2188 Samtools latest source: <https://github.com/samtools/samtools>
2189 .br
2190 HTSlib latest source: <https://github.com/samtools/htslib>
2191 .br
2192 Bcftools website: <http://samtools.github.io/bcftools>