Loading ...
Sorry, an error occurred while loading the content.

reduction of row numbers

Expand Messages
  • Christian Setzkorn
    I am using GP/EAs to induce models from data (e.g. classifiers). I am just wondering whether anyone uses (sampling?) methods to reduce the number of rows of
    Message 1 of 8 , Jan 26, 2012
    • 0 Attachment

      I am using GP/EAs to induce models from data (e.g. classifiers).

       

      I am just wondering whether anyone uses (sampling?) methods to reduce the number of rows of the training data (I am not looking for feature selection methods).

       

      One idea would be to cluster the original data and sample from the cluster prototypes whilst adding some noise (e.g. using self-organising maps).

       

      Any feedback would be very much appreciated. Thanks.

       

      Christian

    • Julian Miller
      Hi Christian You could adapt the co-evolution approach of Schmidt and Lipson Coevolution of fitness predictors IEEE Trans. EC., Vol. 6 pages= 736 - 749 2008
      Message 2 of 8 , Jan 26, 2012
      • 0 Attachment
        Hi Christian

        You could adapt the co-evolution approach of Schmidt and Lipson

        "Coevolution of fitness predictors"
        IEEE Trans. EC., Vol. 6
        pages= 736 - 749
        2008

        downloadable here

        http://creativemachines.cornell.edu/papers/TEC08_Schmidt.pdf

        Cheers

        Julian


        On 26/01/2012 13:22, Christian Setzkorn wrote:
        > I am using GP/EAs to induce models from data (e.g. classifiers).
        >
        > I am just wondering whether anyone uses (sampling?) methods to reduce
        > the number of rows of the training data (I am not looking for feature
        > selection methods).
        >
        > One idea would be to cluster the original data and sample from the
        > cluster prototypes whilst adding some noise (e.g. using self-organising
        > maps).
        >
        > Any feedback would be very much appreciated. Thanks.
        >
        > Christian
        >

        *********************************************************************
        Dr. Julian F. Miller
        Department of Electronics
        University of York
        Heslington, YO10 5DD
        UK
        Email: jfm7@...
        Tel: +44 (0)1904 322383
        Fax: +44 (0)1904 322335
        http://www.elec.york.ac.uk/staff/jfm7.html
        Times Higher Education University of the Year 2010
        ********************************************************************
        First book on Cartesian Genetic Programming is published

        http://www.springer.com/computer/theoretical+computer+science/book/978-3-642-17309-7

        Genetic Programming and Evolvable Machines
        http://www.wkap.nl/journalhome.htm/1389-2576
        IEEE Transactions on Evolutionary Computation
        http://ieee-cis.org/pubs/tec/
        Evolutionary Computation
        http://www.mitpressjournals.org/loi/evco

        "The best way to have a good idea is to have lots of ideas" L. Pauling
        "If at first, the idea is not absurd, then there is no hope for it".
        A. Einstein
        ********************************************************************
      • Alexandre Devert
        Hi,   A computationally efficient way to do that is to build a decision tree with a  deterministic heuristic, like C4.5 or it s more modern derivatives. If
        Message 3 of 8 , Jan 26, 2012
        • 0 Attachment
          Hi,
            A computationally efficient way to do that is to build a decision tree with a 
          deterministic heuristic, like C4.5 or it's more modern derivatives. If some variables 
          do not appear in the tree, then those variables are very likely to be irrelevant 
          to the classification problem.
           
          Dr. Devert Alexandre
          Lecturer at School Of Software Engineering of USTC 
          166 Renai Road, Dushu Lake Higher Education Town Suzhou Industrial Park, 
          Suzhou, Jiangsu, People's Republic of China

          De : Christian Setzkorn <christian@...>
          À : genetic_programming@yahoogroups.com
          Envoyé le : Jeudi 26 janvier 2012 21h22
          Objet : [GP] reduction of row numbers

           
          I am using GP/EAs to induce models from data (e.g. classifiers).
           
          I am just wondering whether anyone uses (sampling?) methods to reduce the number of rows of the training data (I am not looking for feature selection methods).
           
          One idea would be to cluster the original data and sample from the cluster prototypes whilst adding some noise (e.g. using self-organising maps).
           
          Any feedback would be very much appreciated. Thanks.
           
          Christian


        • Alexandre Devert
          Hi,   Wooops, indeed, answered while in deadline mode is not the best thing to do ^^  Self-organizing maps would be a nice way. A computationally cheaper
          Message 4 of 8 , Jan 26, 2012
          • 0 Attachment
            Hi,
              Wooops, indeed, answered while in "deadline" mode is not the best thing to do ^^ 
            Self-organizing maps would be a nice way. A computationally cheaper way, more or less 
            equivalent, is doing K-means/K-mediods with lot of clusters. For enough clusters
            (happy fiddling on that point :D), it would somehow give an equivalent dataset but
            with a lesser density.

            Dr. Devert Alexandre
            Lecturer at School Of Software Engineering of USTC 
            166 Renai Road, Dushu Lake Higher Education Town Suzhou Industrial Park, 
            Suzhou, Jiangsu, People's Republic of China

            De : Christian Setzkorn <christian@...>
            À : 'Alexandre Devert' <marmakoide@...>
            Envoyé le : Vendredi 27 janvier 2012 0h35
            Objet : RE: [GP] reduction of row numbers

            Thanks. Not sure whether/how this is relevant as I said that I do not look into feature selection ...
             
            C
             
            From: genetic_programming@yahoogroups.com [mailto:genetic_programming@yahoogroups.com] On Behalf Of Alexandre Devert
            Sent: 26 January 2012 16:33
            To: genetic_programming@yahoogroups.com
            Subject: Re : [GP] reduction of row numbers
             
             
            Hi,
              A computationally efficient way to do that is to build a decision tree with a 
            deterministic heuristic, like C4.5 or it's more modern derivatives. If some variables 
            do not appear in the tree, then those variables are very likely to be irrelevant 
            to the classification problem.
             
            Dr. Devert Alexandre
            Lecturer at School Of Software Engineering of USTC 
            166 Renai Road, Dushu Lake Higher Education Town Suzhou Industrial Park, 
            Suzhou, Jiangsu, People's Republic of China

            De : Christian Setzkorn <christian@...>
            À : genetic_programming@yahoogroups.com
            Envoyé le : Jeudi 26 janvier 2012 21h22
            Objet : [GP] reduction of row numbers
             
             
            I am using GP/EAs to induce models from data (e.g. classifiers).
             
            I am just wondering whether anyone uses (sampling?) methods to reduce the number of rows of the training data (I am not looking for feature selection methods).
             
            One idea would be to cluster the original data and sample from the cluster prototypes whilst adding some noise (e.g. using self-organising maps).
             
            Any feedback would be very much appreciated. Thanks.
             
            Christian
             


          • Bill LANGDON
            ... I am not sure why you want to do this? To save time? If so... Most of GP run time is spent on fitness evaluation but the only point of fitness evaluation
            Message 5 of 8 , Jan 30, 2012
            • 0 Attachment
              > I am just wondering whether anyone uses (sampling?) methods to
              > reduce the number of rows of the training data (I am not looking for
              > feature selection methods).

              I am not sure why you want to do this?
              To save time? If so...

              Most of GP run time is spent on fitness evaluation but the
              only point of fitness evaluation is to decide who gets to have
              children. Ie all fitness information is reduced to one bit
              (child/no child). Also most selection schemes also inject some noise.

              In tournament selection all we need to know is which is the best
              potential parent. Indeed we need not be exactly sure about who is
              best, who is approximately the better can be sufficient. Provided when
              we approximate fitness we do not add some unpleasant bias.
              By approximating fitness we can often reduce fitness computation time.
              Eg by randomly selecting a small number of training cases for use in
              the tournament.
              Two scheme I have tried are
              1) using the same sample for everyone in the tournament and then
              resampling for the next tournament and
              2) using the same sample for the current generation and resampling
              at the start of the next generation.

              Sampling tends to reduce selection pressure, so leading to longer
              evolution times (in terms of number of generations) but this can still
              be a major win if each fitness evaluation is faster.
              In EuroGP-2010, GP was happily finding complete solutions when
              run on less than 1 in a million of the test cases.
              http://www.cs.bham.ac.uk/~wbl/biblio/gp-html/langdon_2010_eurogp.html


              Bill

              Dr. W. B. Langdon,
              Department of Computer Science,
              University College London
              Gower Street, London WC1E 6BT, UK
              http://www.cs.ucl.ac.uk/staff/W.Langdon/

              CIGPU 2012 http://www.cs.ucl.ac.uk/staff/W.Langdon/cigpu
              EvoPAR 2012 http://www.cs.ucl.ac.uk/staff/W.Langdon/evopar
              RNAnet http://bioinformatics.essex.ac.uk/users/wlangdon/rnanet/
              A Field Guide to Genetic Programming
              http://www.gp-field-guide.org.uk/
              GP EM http://www.springer.com/10710
              GP Bibliography http://www.cs.bham.ac.uk/~wbl/biblio/
            • Christian Setzkorn
              Yes thanks. I will try some n fold stratified crossvalidation approach in the near furture. C
              Message 6 of 8 , Jan 30, 2012
              • 0 Attachment
                Yes thanks. I will try some n fold stratified crossvalidation approach in the near furture.

                C

                On Mon, Jan 30, 2012 at 7:11 PM, Bill LANGDON <W.Langdon@...> wrote:
                 


                > I am just wondering whether anyone uses (sampling?) methods to
                > reduce the number of rows of the training data (I am not looking for
                > feature selection methods).

                I am not sure why you want to do this?
                To save time? If so...

                Most of GP run time is spent on fitness evaluation but the
                only point of fitness evaluation is to decide who gets to have
                children. Ie all fitness information is reduced to one bit
                (child/no child). Also most selection schemes also inject some noise.

                In tournament selection all we need to know is which is the best
                potential parent. Indeed we need not be exactly sure about who is
                best, who is approximately the better can be sufficient. Provided when
                we approximate fitness we do not add some unpleasant bias.
                By approximating fitness we can often reduce fitness computation time.
                Eg by randomly selecting a small number of training cases for use in
                the tournament.
                Two scheme I have tried are
                1) using the same sample for everyone in the tournament and then
                resampling for the next tournament and
                2) using the same sample for the current generation and resampling
                at the start of the next generation.

                Sampling tends to reduce selection pressure, so leading to longer
                evolution times (in terms of number of generations) but this can still
                be a major win if each fitness evaluation is faster.
                In EuroGP-2010, GP was happily finding complete solutions when
                run on less than 1 in a million of the test cases.
                http://www.cs.bham.ac.uk/~wbl/biblio/gp-html/langdon_2010_eurogp.html

                Bill

                Dr. W. B. Langdon,
                Department of Computer Science,
                University College London
                Gower Street, London WC1E 6BT, UK
                http://www.cs.ucl.ac.uk/staff/W.Langdon/

                CIGPU 2012 http://www.cs.ucl.ac.uk/staff/W.Langdon/cigpu
                EvoPAR 2012 http://www.cs.ucl.ac.uk/staff/W.Langdon/evopar
                RNAnet http://bioinformatics.essex.ac.uk/users/wlangdon/rnanet/
                A Field Guide to Genetic Programming
                http://www.gp-field-guide.org.uk/
                GP EM http://www.springer.com/10710
                GP Bibliography http://www.cs.bham.ac.uk/~wbl/biblio/


              • Bill LANGDON
                ... Ok. I was trying to say perhaps you do not need to be sophisticated. a simple tiny random sample may be all that is needed. Bill
                Message 7 of 8 , Jan 30, 2012
                • 0 Attachment
                  > Yes thanks. I will try some n fold stratified crossvalidation
                  > approach in the near furture.

                  Ok.
                  I was trying to say perhaps you do not need to be sophisticated.
                  a simple tiny random sample may be all that is needed.
                  Bill
                • Bill LANGDON
                  Sure although the coevolution literature may not explicitly talk about training cases a lot of papers effectively have a second population whose purpose is to
                  Message 8 of 8 , Jan 31, 2012
                  • 0 Attachment
                    Sure although the coevolution literature may not explicitly talk about
                    training cases a lot of papers effectively have a second population
                    whose purpose is to test the first population. There is some
                    discussion in "Genetic Programming and Data Structures"
                    http://www.cs.ucl.ac.uk/staff/W.Langdon/gpdata/ about using reduced
                    fitness.

                    In software engineering there is great interest in how to
                    order regression tests so that bugs are detected earlier (Shin Yoo).
                    Better to have the test suite report your error whilst you are still
                    in the office rather than at the end of an overnight run.

                    You could look at Sid Maxwell's idea of only partially testing the parents
                    so you only runs them enough to be sure which is the better. That is
                    not always running them to their ends
                    http://www.cs.bham.ac.uk/~wbl/biblio/gp-html/icec94_maxwell.html
                    Chris Gathercole's LEF/DSS is used commercially
                    http://www.cs.bham.ac.uk/~wbl/biblio/gp-html/ga94aGathercole.html
                    Whilst Alden Tackett
                    http://www.cs.bham.ac.uk/~wbl/biblio/gp-html/Tackett_1994_broodGP.html
                    used Lee Altenberg's soft brood selection aims to speed evolution
                    by creating several children but only letting a fraction
                    grow to maturity. (Ie only a fraction of them are fully tested,
                    the runts suffer less testing and so consume less resources.)
                    Dave Andre and Astro Teller had a nice look at using statistical tests
                    to decide where the balance point lies
                    http://www.cs.bham.ac.uk/~wbl/biblio/gp-html/Teller_1997_acnfc.html

                    Remember fitness testing is usually the most expensive part of GP but
                    typically we reduce all the information gained to one bit (child/no
                    child) and even inject noise into that bit. There is no need to be
                    fair, everyone dies, but be careful of biased sampling.
                    You can get a lot of mileage by a simple scheme which repeatedly
                    randomly selects a subset of tests to use rather than using them all.
                    In some cases evolution can still succeed even when only one test
                    case (of thousands) is used, provided you keep randomly changing which test.

                    Bill
                  Your message has been successfully submitted and would be delivered to recipients shortly.