A sourmash tutorial

sourmash is our lab’s implementation of an ultra-fast lightweight approach to nucleotide-level search and comparison, called MinHash.

You can read some background about MinHash sketches in this paper: Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.

Installing sourmash

To install sourmash, run:

pip install https://github.com/dib-lab/sourmash/archive/2017-ucsc-metagenome.zip

(Note, we are installing from a development branch; many of the features below are not part of an official sourmash release yet. They should be included in sourmash 2.0.)

Fingerprint reads

Use case: how much do two (or more!) unassembled metagenomes resemble each other?

Compute a scaled MinHash fingerprint from our reads:

mkdir ~/sourmash
cd ~/sourmash

sourmash compute --scaled 10000 ~/mapping/SRR*.pe.fq -k 21,31

Now, compare the two files at k=21:

sourmash compare SRR*.sig -k 21

or k=31:

sourmash compare SRR*.sig -k 31

Compare reads to assemblies

Use case: how much of the read content is contained in the assembly?

Fingerprint the assembly:

sourmash compute --scaled 10000 -k 21,31 ~/mapping/subset_assembly.fa

and now evaluate containment, that is, what fraction of the read content is contained in the assembly:

sourmash search -k 21 SRR1976948.abundtrim.subset.pe.fq.sig \
    subset_assembly.fa.sig  --containment

and you should see:

1 matches; showing 3:
         /home/titus/mapping/subset_assembly.fa          0.573   subset_assembly.fa.sig

Try the reverse - why is it bigger?

sourmash search -k 21 subset_assembly.fa.sig \
    SRR1976948.abundtrim.subset.pe.fq.sig --threshold=0.0 --containment

what do you get if you do this with the other set of reads?

Compare many samples

Adjust plotting (this is a bug in sourmash :) –

echo 'backend : Agg' > matplotlibrc

Do a comparison:

sourmash compare SRR*.sig subset*.sig -o comparison

and then plot:

sourmash plot --pdf comparison

which will produce a file comparison.matrix.pdf and comparison.dendro.pdf that you can grab view your Jupyter Notebook console.

What’s in my metagenome?

Download and unpack the k=21 RefSeq index described in CTB’s blog post:

curl -O http://spacegraphcats.ucdavis.edu.s3.amazonaws.com/microbe-sbt-k21-2016-11-27.tar.gz
tar xzf microbe-sbt-k21-2016-11-27.tar.gz

This produces a file microbes.sbt.json and a whole bunch of hidden files in the directory .sbt.microbes. This is an index of about 60,000 microbial genomes from RefSeq.

Next, run the ‘gather’ command to see what’s in there –

sourmash sbt_gather -k 21 microbes subset_assembly.fa.sig

and you should get:

Final composition (sorted by percent of original query):

p_query p_genome
  2.4    14.3   NC_010003.1 Petrotoga mobilis SJ95, complete genome
  0.5     1.6   NC_017934.1 Mesotoga prima MesG1.Ag.4.2, complete genome
  3.0%          (percent of query identified)

If you go to the SRA information for this project, you’ll see that Petrotoga and Mesotoga are both in there - yay! Of course, in this case we’re looking at largely unknown critters (3% at most is in genbank!) so we wouldn’t expect many matches.


LICENSE: This documentation and all textual/graphic site content is released under Creative Commons - 0 (CC0) -- fork @ github.