- On 24/8/04 4:20, Nitin Muttil wrote:

> I am noticing that when the depth of GP tree is

It sounds like it, Nitin, but you would need to look at the fits themselves

> increased, the model performs better in training, but worsens for the test

> data. I assume this is because of overfitting, something similar to

> overfitting in neural networks, when hidden layers/nodes are increased.

>

> 1) Is this actually overfitting?

to see whether it is.

> If so, is there a optimal GP equation size,

It is highly data dependent, in my experience. One person's overfitting can

> or has it to be fixed by trial and error?

be another person's invaluable information! This is no different from any

other regression/forecasting problem, for which there is very extensive

literature.

> 2) Can I get pointers to studies to find optimal values of GP parameters like

In different applications, yes - see the GP bibliography, etc. But you are

> equation size, population size, crossover and mutation rate, etc.

the judge as to whether any of that will apply to your data.

Assuming that your training dataset is noisy, I would suggest that your next

step is to produce a noise-free training dataset. You may need to complete

this by hand. Try training on that and then testing on regular noisy data.

Then try varying tree depth and complexity, population size, etc., to see

what effects those have. You will need not only to look at fitness and

prediction error, but actually to look at data plots. If your data has a lot

of outliers to which you do not wish fitting, then you may find it better to

use an absolute deviation in the fitness function, rather than the classical

squared deviation, which tends to weight towards outliers, of course - these

issues and others have been covered well in the GP and statistical

literature.

I wish you success,

Regards,

Howard.

Dr Howard Oakley

The Works columnist for MacUser magazine (UK)

http://www.macuser.co.uk/

http://www.howardoakley.com/ - Hi Nitin,

Interesting problem. Yes, it seems like overfitting. Have you

considered pruning or even ensembles? You could even put a penalty

term in the fitness function which penalises trees if they are very

deep.

Cheers,

Arjun

--- In genetic_programming@yahoogroups.com, "Nitin Muttil"

<nitin.muttil@n...> wrote:> Dear GP list,

explain what HAB is in brief, it is an explosive growth of algae in

>

> I have been trying GP for harmful algal bloom (HAB) predictions. To

coastal waters, caused due to dumping pollutants in those waters. HABs

can be toxic and thus may harm aquatic life and in some cases even

humans.>

the models on the unseen test data. I am noticing that when the depth

> I am evolving GP models using a training dataset and then testing

of GP tree is increased, the model performs better in training, but

worsens for the test data. I assume this is because of overfitting,

something similar to overfitting in neural networks, when hidden

layers/nodes are increased.>

equation size, or has it to be fixed by trial and error?

> My questions are:

>

> 1) Is this actually overfitting? If so, is there a optimal GP

>

parameters like equation size, population size, crossover and mutation

> 2) Can I get pointers to studies to find optimal values of GP

rate, etc.>

> Thanks very much and any help would be highly appreciated.

>

> Best regards,

> Nitin - Another thing you can do is have about 15 different

test datasets. Then use a randomly selected testset

to test the GP's against. That way there is not any

possible way of "overfittness". REMEMBER NOT EVERY

CREATURE TRAVELS IN THE SAME SHOES.

_______________________________

Do you Yahoo!?

Win 1 of 4,000 free domain names from Yahoo! Enter now.

http://promotions.yahoo.com/goldrush