Re: [ssahausers] SSAHA Psuedocode documentation
The algorithms are very simple. The key set is very small (when compared
to the sequence data). We map ALL kmers whether they appear in the
sequence or not. There is a 2-bit encoding:
A = 00
C = 01
G = 10
T = 11
Then each kmer is a 2*k bit 'integer'. Once k is defined you process each
kmer by looping over the K bases, assign the 2 bits for each bases and
bit-shift 2 to the left. So for example if k=12 you use 24 bits.
There are (|A|)^k permuatations for an alphabet A, so for DNA there are
4^k permutations, for k=12 there are 4^12 = 16777216 keys. The hashtable
consists of a head, with one entry for each kmer, and a body.
head[i] points to an entry in body, so that body[head[i]] is the first
occurrance of kmer i in the subject sequence, body[head[i]+1] is the next
occurrance etc, until you reach body[head[i+1]] which is the first
occurrance of the next kmer.
To process a query you loop over the (overlapping) kmers, once you have
calculated i for a kmer, body[head[i]] ... body[head[i+1]-1] are all the
locations of that kmer in the subject.
Let me know if you need any clarification.
On Mon, 28 Nov 2005, greenexile4568 wrote:
> Is there a pseudocode representation of the SSAHA algorithm ?
> I'm particulary interested in learning what approaches were taken to
> implement the various key steps in the algorithm. For example did you
> use a decrease and conquer approach when generating the possible kmer
> Be grateful for any information you can provide
> Brian Wilson
> Yahoo! Groups Links
Dr Adam Spargo
High Performance Assembly Group email: aws@...
Wellcome Trust Sanger Institute Tel: +44 (0)1223 834244 x7728
Hinxton, Cambridge CB10 1SA Fax: +44 (0)1223 494919