- Hi,

On Fri, 8 Oct 2004, Toon Knapen wrote:

>

> Hi all,

>

> The ublas dev meeting is scheduled from 25 to 27th of octobre. Currently

> only 4 persons have confirmed (Michael Stevens, Gunther Winkler, Karl

> Meerbergen and Toon Knapen). If you would like to join us, please

> confirm quickly.

That week is unfortunately looking to be quite busy, but I think I will be

able to make it, at least for one or two days. I might need to be back in

Aachen on the 27th.

If you like, I can prepare a talk on my current project, which is an

application where the data structures are well fitted by nested

sparse and/or dense matrices. Although it doesn't use uBLAS some of the

techniques might be of interest, including BLAS integration, interfaces,

generics, etc. And especially, what works well versus what I would do

differently if I could.

Cheers,

Ian - Peter Schmitteckert wrote:

>

Yes, of the 16 possible variants of normal/transposed/conjugate/hermitian

> Salut,

>

> On Fri, 8 Oct 2004, Ian McCulloch wrote:

>

>> A * conj(B) is part of BLAS, although not for all possible variants of

>> row

>

>

> As as as I understand it it, A * conj(B) is not part of BLAS, there

> is only 'N', 'T', and 'C' (hermitian conjugated).

> One can simulate a conjugated BLAS, if one is willing to interchange

> columns vs. row major storage.

>

> C^T = ( A * conj(B) )^T = B^H * A^T,

>

> but then

>

> C^T = ( A^{H} * B )^T = B^T * A^{*},

>

> fails, which is annoying at least for complex DMRG.

for the two right hand parameters, only 9 are natively supported. Another

6 can be treated at the cost of an in-place conjugation of the result

matrix, and the last variant requires an in-place transpose. I don't know

how large the crossover size is at which point these operations become

efficient, probably quite small (especially for the conjugation).

>

ATLAS defines a block size which is used internally. For large enough

>> column storage of A and B. A<double> * B<std::complex<double>> is not,

>> but except for small matrices it would be faster to construct a temporary

>> matrix of std::complex and call zgemm.

>

> But a BLAS implementation wouldn't need this extra copy, and there is

> an operation overhead of a factor of 2, since the imaginary part of

> the real matrix is zero, plus unneeded load/store memory operation for

> transferring all these unneded zeroes.

>

>> Well, I am far from an expert on this, but IMHO that would be either a

>> huge duplication of effort (to reimplement ATLAS in C++) or a maintenance

>> nightmare (to import the existing C source as is and keep it in sync with

>> future versions of ATLAS). Better would be to encourage ATLAS to

>> implement

>> some BLAS extensions for things like complex*real multiply. These are

>> already implemented in reference form in extended BLAS

>> (http://crd.lbl.gov ~xiaoye/XBLAS/) but these are a few years old now and

>> I never heard any followup plan.

>

> I hoped that there would be a way to incorporate the ATLAS kernels

> as buildings blocks, but my only ATLAS knowlesge comes from the

> looking at a few ATLAS web pages.

> I guess that for small matrices there could be a huge performace win

> due to avoided overhead, and an increased flexibility.

matrices, they are copied into a blocked format which is more cache

efficient. A nice idea for a high performance library would be to use this

already optimized ATLAS format for temporary matrices. But, ATLAS doesn't

document this format, and they probably wouldn't want to either. My

suspicion is the performance win of saving a copy is too small to bother

with. But maybe I'm wrong... does anyone here know any of the ATLAS

developers?

Cheers,

Ian