A lot of times when you’re working with huge files (such as the SAM/BAM format used with genomic sequencing data), you’re only interested in portions within the file having certain characteristics. The core idea of this example (using named pipes) is broadly applicable, but I’ll focus on the aforementioned “SAM” genomic data format. There are lots of ways to view and manipulate SAM records, but samtools is by far one of the fastest and most robust implementations for doing so.
I see a lot of examples of people using
samtools to create a new BAM file by filtering on the various available flags, before later accessing those records programmatically via a high-level library. This is certainly a better solution than accessing each record individually, but there’s an even better way: named pipes.
Named pipes are an incredibly useful, underrated Linux tool. Instead of creating a separate file for your adjusted data, you can create a stream which is linked to an underlying process and streams data from that process as needed without using space on your volume. In this example, using named pipes in combination with samtools allows you to efficiently filter your data using flags without unnecessarily using extra disk space or memory.
I posted this solution on Biostars, but I was engaging in thread necromancy so I thought I’d repeat the advice here, with a little more detail about why it’s a good solution. Here’s an example using named pipes with a samtools command line call and pysam:
#let's say we want only mapped reads from a BAM
command = 'samtools view -hb -F 4 test_data.bam > bampipe'
p = subprocess.Popen(command, shell=True)
s = pysam.AlignmentFile('bampipe', 'rb')
Here we’re creating the named pipe using the Python os.mkfifo() method. Most POSIX-friendly languages will have their own library or command, or you can use
mkfifo on the command line. The pipe actually shows up when you run
ls. It’s usually yellow in color and has the
<strong>p</strong>rw-r--r-- 1 ddorset ddorset 0 Apr 24 21:16 bampipe
I’ve used this technique to great effect when cycling through several BAM files at once in the same script. Being able to access the same or multiple BAM files in different contexts while not using up extra memory or CPU is a huge help.