I have forwarded the first part of your query to Tony Cox, who wrote the
current SSAHA implementation. I hope that he will have a clearer idea
about the issue that you are seeing. I would suggest that the main memory
is exhausted after loading the first 1500 sequences, causing the program
to use the slower 'swap' memory to complete the query. I would suggest
that you need to run your queries in batches of ~500 to ensure that
system resources are optimised, and that you can write the output file!
The Ensembl web service uses a client-server ssaha implementation; the
server loads the subject sequence into memory and then opens a
socket. The subject is persistent until the server is explicitly
killed. The client (called from the web server) is then used to run
queries on the server over TCP/IP. In this way, the server and client do
not need to run on the same physical machine. The code for this is in the
EnsemblServer dir of the SSAHA distribution.
Returning to your original problem, the client-server code may well be the
most efficient solution. As the subject sequence is loaded only once, you
can run many individual queries with no loss in performance.
Dr William Spooner whs@...
Ensembl Web Developer http://www.ensembl.org
On Sat, 26 Oct 2002, sennlu wrote:
> Dear Will Spooner,
> I'm a research assistant in Academia Sinica, Taiwan.
> In fact, I'm the second author of "Single Nucleotide
> Polymorphism Mapping Using Genome-Wide Unique
> Sequences" published on Genome Research 12:
> I must confess that we use very similar principle like
> SSAHA. However, we are trying to make sequence
> mapping less memory consuming so that we can do
> that with light equipments.Actually, we don't have 16G
> byte memory even putting together all memory modules
> in our lab.
> However, I download SSAHA's binary code and
> excute it on my Linux plateform (AMD 1G Hz CPU,
> 512MB RAM and 40 GB hard drive, kerel 2.4.18-5).
> I try to make human chromosome 1 (~235M bps) as
> subject database and chromosome 1 SNPs in dbSNP
> (178,525 sequences corresponding to NCBI human
> genome build 28) as query sequences.
> I found that SSAHA's behavior surprised me.You
> may see the attached file (curve.jpg) for the correlation
> between time consuming and query sequence number.
> It runs extremely fast when query sequences are few but
> slower when sequence number exceeds a theshold.
> Theoratically, the spending time should proportion to
> subject database size and query sequence number.
> That is, if subject hash file keep unchanged, linear
> correlation is expected to be observed. In addition, I
> didn't put all 178,525 sequences as query file because the
> strange phenomenon is seen in preliminary trial.And the
> output file is too large (>2GB) to incorporate in Linux
> ext3 file system thus I got to discard them.
> Following is the command I use:
> time ./ssaha_v30_linux query.fas chr1 -qf fasta -sf hash -da 0 -pf >
> /dev/null &
> By the way, the default hash word length seems to be
> 12mer rather than 10mer. Using "top" command, I found
> that the memory usage is only about 160MB, so I think
> swap memory is retained. I'm not sure what goes wrong
> with my setting. Can you give me some advice?
> And another question, I'm very curiously on how do you
> build up your internet service. I mean, what's the
> framework to maintain a large data block in physical memory
> and connect with outside web query request? It's not
> supposed to load hash file once for each query. Job runs on
> local machine is quite different to web service. Would you
> please give me some hints? Thanks for your consideration.
> Best Regards,
> Szu-Hsien Lu