# A sourmash tutorial [sourmash](http://sourmash.readthedocs.io/en/latest/) is our lab's implementation of an ultra-fast lightweight approach to nucleotide-level search and comparison, called MinHash. You can read some background about MinHash sketches in this paper: [Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.](http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x) ## Installing sourmash To install sourmash, run: ``` pip install https://github.com/dib-lab/sourmash/archive/2017-ucsc-metagenome.zip ``` (Note, we are installing from [a development branch](https://github.com/dib-lab/sourmash/pull/188); many of the features below are not part of an official sourmash release yet. They should be included in sourmash 2.0.) ## Fingerprint reads Use case: how much do two (or more!) unassembled metagenomes resemble each other? Compute a scaled MinHash fingerprint from our reads: ``` mkdir ~/sourmash cd ~/sourmash sourmash compute --scaled 10000 ~/mapping/SRR*.pe.fq -k 21,31 ``` Now, compare the two files at k=21: ``` sourmash compare SRR*.sig -k 21 ``` or k=31: ``` sourmash compare SRR*.sig -k 31 ``` ## Compare reads to assemblies Use case: how much of the read content is contained in the assembly? Fingerprint the assembly: ``` sourmash compute --scaled 10000 -k 21,31 ~/mapping/subset_assembly.fa ``` and now evaluate *containment*, that is, what fraction of the read content is contained in the assembly: ``` sourmash search -k 21 SRR1976948.abundtrim.subset.pe.fq.sig \ subset_assembly.fa.sig --containment ``` and you should see: ``` 1 matches; showing 3: /home/titus/mapping/subset_assembly.fa 0.573 subset_assembly.fa.sig ``` Try the reverse - why is it bigger? ``` sourmash search -k 21 subset_assembly.fa.sig \ SRR1976948.abundtrim.subset.pe.fq.sig --threshold=0.0 --containment ``` what do you get if you do this with the other set of reads? ## Compare many samples Adjust plotting (this is a bug in sourmash :) -- ``` echo 'backend : Agg' > matplotlibrc ``` Do a comparison: ``` sourmash compare SRR*.sig subset*.sig -o comparison ``` and then plot: ``` sourmash plot --pdf comparison ``` which will produce a file `comparison.matrix.pdf` and `comparison.dendro.pdf` that you can grab view your Jupyter Notebook console. ## What's in my metagenome? Download and unpack the k=21 RefSeq index described in [CTB's blog post](http://ivory.idyll.org/blog/2016-sourmash-sbt-more.html): ``` curl -O http://spacegraphcats.ucdavis.edu.s3.amazonaws.com/microbe-sbt-k21-2016-11-27.tar.gz tar xzf microbe-sbt-k21-2016-11-27.tar.gz ``` This produces a file `microbes.sbt.json` and a whole bunch of hidden files in the directory `.sbt.microbes`. This is an index of about 60,000 microbial genomes from RefSeq. Next, run the 'gather' command to see what's in there -- ``` sourmash sbt_gather -k 21 microbes subset_assembly.fa.sig ``` and you should get: ``` Final composition (sorted by percent of original query): p_query p_genome 2.4 14.3 NC_010003.1 Petrotoga mobilis SJ95, complete genome 0.5 1.6 NC_017934.1 Mesotoga prima MesG1.Ag.4.2, complete genome 3.0% (percent of query identified) ``` If you go to [the SRA information](https://www.ncbi.nlm.nih.gov/bioproject/PRJNA278302) for this project, you'll see that Petrotoga and Mesotoga are both in there - yay! Of course, in this case we're looking at largely unknown critters (3% at most is in genbank!) so we wouldn't expect many matches.