Hi,

The algorithms are very simple. The key set is very small (when compared

to the sequence data). We map ALL kmers whether they appear in the

sequence or not. There is a 2-bit encoding:

A = 00

C = 01

G = 10

T = 11

Then each kmer is a 2*k bit 'integer'. Once k is defined you process each

kmer by looping over the K bases, assign the 2 bits for each bases and

bit-shift 2 to the left. So for example if k=12 you use 24 bits.

There are (|A|)^k permuatations for an alphabet A, so for DNA there are

4^k permutations, for k=12 there are 4^12 = 16777216 keys. The hashtable

consists of a head, with one entry for each kmer, and a body.

head[i] points to an entry in body, so that body[head[i]] is the first

occurrance of kmer i in the subject sequence, body[head[i]+1] is the next

occurrance etc, until you reach body[head[i+1]] which is the first

occurrance of the next kmer.

To process a query you loop over the (overlapping) kmers, once you have

calculated i for a kmer, body[head[i]] ... body[head[i+1]-1] are all the

locations of that kmer in the subject.

Let me know if you need any clarification.

Cheers,

Adam.

On Mon, 28 Nov 2005, greenexile4568 wrote:

> Hi,

> Is there a pseudocode representation of the SSAHA algorithm ?

>

> I'm particulary interested in learning what approaches were taken to

> implement the various key steps in the algorithm. For example did you

> use a decrease and conquer approach when generating the possible kmer

> permutations..

>

> Be grateful for any information you can provide

> Thanks

> Brian Wilson

>

>

>

>

>

>

>

> Yahoo! Groups Links

>

>

>

>

>

>

>

--

Dr Adam Spargo

High Performance Assembly Group email: aws@...

Wellcome Trust Sanger Institute Tel: +44 (0)1223 834244 x7728

Hinxton, Cambridge CB10 1SA Fax: +44 (0)1223 494919