Loading ...
Sorry, an error occurred while loading the content.

QLearning

Expand Messages
  • Ivan F. Villanueva B.
    Dear soul-mates, I would like to comment some issues of the QLearning algorithm in the book. (1) First of all I think there is a mistake in the pseudo code:
    Message 1 of 1 , Sep 9, 2006
    View Source
    • 0 Attachment
      Dear soul-mates,

      I would like to comment some issues of the QLearning algorithm in the book.

      (1)
      First of all I think there is a mistake in the pseudo code: For the update of
      Q[a,s], r' instead of r should be used. Otherwise the values of the final states
      will never be taken into account.

      (2)
      If my point (1) is correct, then the static variable r is not needed in the
      code.

      (3)
      Nsa should be initialized to 0 for all values.

      (4)
      max(a') should be randomly chosen if the are more than one maximum (between
      them)

      (5)
      f(q,n) must return a value even when q is null, i.e. when the agent has no idea
      of the value of Q(a', s')
      In the book it works because f(q,n) returns 2 the first 5 iterations, regardless
      of the value of Q(a', s') (explained in the page 774 of the International Edition)

      (6)
      I have been playing with the QLearning algorithm (after the modification
      described in (1)) and the simple MDP world example. I have checked it with two
      different parameters sets. The one described in the book:
      - reward of non-terminal states: -0.02 [r]
      - applies the value 2 [rp] to actions done less than 5 [en] times
      - the learning rate [rl] is
      60 / (60 - 1) + iteration
      i.e. the parameter rl is here 60
      - the number of trials is 2000 [tn]
      and the values:
      lr = 5 // learning rate
      en = 100 ; // exploration number
      rp = 2 ; // value of unknown states
      tn = 300 ; // number of trials
      r = -0.05

      I have computed the q-values of the four actions in state 3,3 for each
      iteration. With the first set (see attached graph1.png for the values of a
      representative experiment), in most of the experiments, the values are wrong at
      the end.
      With the second set, the values are correct for all experiments I have performed
      so far (see attached graph2.png for an example)

      Regards,
      --
      Ivan F. Villanueva B.
      A.I. library: http://www.artificialidea.com
      <<< The European Patent Litigation Agreement (EPLA) >>>
      <<< will bring Software patents by the backdoor >>>
      <<< http://www.no-lobbyists-as-such.com/florian-mueller-blog/epla/ >>>
      <<< http://wiki.ffii.de/EplaEn >>>
    Your message has been successfully submitted and would be delivered to recipients shortly.