Loading ...
Sorry, an error occurred while loading the content.
 

Algorithm for randomly choosing a question

Expand Messages
  • Omer Zak
    I am developing an Android application which is intended to help people prepare for the theory test for driving license. Basically, there is a database of N
    Message 1 of 7 , Jul 6, 2012
      I am developing an Android application which is intended to help people
      prepare for the theory test for driving license.

      Basically, there is a database of N questions, numbered from 1 until N
      without gaps.
      Each time the application chooses randomly a question and the user
      answers it.
      The application records whether the answer was correct or not and then
      selects the next question for display.

      I want the application to be able to select any question randomly, but
      with different weights, so that:
      - A question the user never saw has weight of (say) 0.5.
      - A question the user answered wrongly has weight of 1 (i.e. such a
      question is twice more likely to be presented than a question never
      seen).
      - A question the user answered correctly has weight of 0.005 (i.e. it is
      100 times less likely to be seen again than a question never seen
      before).

      A brute force approach is as follows (actual probabilities may differ
      from the given weights, due to the finite probability of choosing the
      same i again; I am ignoring this complication for now).

      1. Select a number randomly from range [1,N], call it i.
      2. If the user answered question i wrongly, it is chosen.
      3. If the user never saw question i, select randomly real x from range
      [0,1].
      If x <= 0.5, i is chosen.
      Otherwise, go to 1.
      4. If the user already answered question i correctly, select randomly
      real x from range [0,1].
      If x <= 0.005, i is chosen.
      Otherwise, go to 1.

      My wish is to have an algorithm which requires only bounded number of
      computations of random numbers (i.e. no loop) to accomplish the above.
      Assume that N is large, so keeping lists of questions in memory is ruled
      out.

      Any suggestions?

      --- Omer


      --
      Did you shave a yak today?
      My own blog is at http://www.zak.co.il/tddpirate/

      My opinions, as expressed in this E-mail message, are mine alone.
      They do not represent the official policy of any organization with which
      I may be affiliated in any way.
      WARNING TO SPAMMERS: at http://www.zak.co.il/spamwarning.html
    • Ori Idan
      ... You don t have to keep a list of questions, just a list of numbers where each number represent a question. A question you want to appear twice as much as
      Message 2 of 7 , Jul 6, 2012


        I want the application to be able to select any question randomly, but
        with different weights, so that:
        - A question the user never saw has weight of (say) 0.5.
        - A question the user answered wrongly has weight of 1 (i.e. such a
        question is twice more likely to be presented than a question never
        seen).
        - A question the user answered correctly has weight of 0.005 (i.e. it is
        100 times less likely to be seen again than a question never seen
        before).

        A brute force approach is as follows (actual probabilities may differ
        from the given weights, due to the finite probability of choosing the
        same i again; I am ignoring this complication for now).

        1. Select a number randomly from range [1,N], call it i.
        2. If the user answered question i wrongly, it is chosen.
        3. If the user never saw question i, select randomly real x from range
        [0,1].
        If x <= 0.5, i is chosen.
        Otherwise, go to 1.
        4. If the user already answered question i correctly, select randomly
        real x from range [0,1].
        If x <= 0.005, i is chosen.
        Otherwise, go to 1.

        My wish is to have an algorithm which requires only bounded number of
        computations of random numbers (i.e. no loop) to accomplish the above.
        Assume that N is large, so keeping lists of questions in memory is ruled
        out.

        You don't have to keep a list of questions, just a list of numbers where each number represent a question.
        A question you want to appear twice as much as any other, put it twice in the list.
        For a question you want to appear 100 times less then any other, use the algorithm you suggested.

        -- 
        Ori Idan

      • Omer Zak
        On Sat, 2012-07-07 at 08:20 +0300, Ori Idan wrote: [... description of the problem was snipped ...] ... Thanks for your interest. The proposed approach has the
        Message 3 of 7 , Jul 6, 2012
          On Sat, 2012-07-07 at 08:20 +0300, Ori Idan wrote:
          [... description of the problem was snipped ...]
          > My wish is to have an algorithm which requires only bounded
          > number of
          > computations of random numbers (i.e. no loop) to accomplish
          > the above.
          > Assume that N is large, so keeping lists of questions in
          > memory is ruled
          > out.
          >
          >
          >
          > You don't have to keep a list of questions, just a list of numbers
          > where each number represent a question.
          > A question you want to appear twice as much as any other, put it twice
          > in the list.
          > For a question you want to appear 100 times less then any other, use
          > the algorithm you suggested.

          Thanks for your interest.

          The proposed approach has the problem of memory consumption of order of
          O(N) (where N is the number of the questions). I'd like to see an
          algorithm whose memory consumption is closer to O(1) than to O(N).

          --- Omer


          --
          Bottom posters are filthy heretics and infidels and ought to be burned
          on the stake after having been tarred, feathered and having rotten eggs
          thrown at their faces!
          My own blog is at http://www.zak.co.il/tddpirate/

          My opinions, as expressed in this E-mail message, are mine alone.
          They do not represent the official policy of any organization with which
          I may be affiliated in any way.
          WARNING TO SPAMMERS: at http://www.zak.co.il/spamwarning.html
        • Amit Aronovitch
          ... sent before completed (whoops again)... I ment: you do not have to store weights of ALL the questions, just for the questions the user had seen (and you
          Message 4 of 7 , Jul 7, 2012


            On Sun, Jul 8, 2012 at 12:38 AM, Amit Aronovitch <aronovitch@...> wrote:


            On Sun, Jul 8, 2012 at 12:28 AM, Amit Aronovitch <aronovitch@...> wrote:

            On Sat, Jul 7, 2012 at 9:06 AM, Om er Zak <w1@...> wrote:
            On Sat, 2012-07-07 at 08:20 +0300, Ori Idan wrote:
            [... description of the problem was snipped ...]
            >         My wish is to have an algorithm which requires only bounded
            >         number of
            >         computations of random numbers (i.e. no loop) to accomplish
            >         the above.
            >         Assume that N is large, so keeping lists of questions in
            >         memory is ruled
            >         out.
            >

            (rather long, but nicely written and not hard to follow)
             

            Whoops, missed the "N is large" thing (so O(n) memory might be too high, even if you just keep weights...). Still, a nice read to share.
            I guess there's no escape from trading some memory for the time improvement...
             
            sent before completed (whoops again)...

            I ment: you do not have to store weights of ALL the questions, just for the questions the user had seen (and you have to store a list of these anyway). Since you know how many the user had already chosen, and the sum of their weights - first choose (using a single dart) whether you want a question from the seen set, or the unseen one (which is supposedly much larger). If the larger is chosen, choose uniformly. If the smaller (stored) set is selected, choose using the alias method or some simpler alternative.


          • Nadav Har'El
            ... Omer, here is an O(1) *RAM* algorithm. Note that you still have O(N) disk usage, of course - there are N questions, and they need to be saved somewhere,
            Message 5 of 7 , Jul 8, 2012
              On Sat, Jul 07, 2012, Omer Zak wrote about "Re: [hackers-il] Algorithm for randomly choosing a question":
              > The proposed approach has the problem of memory consumption of order of
              > O(N) (where N is the number of the questions). I'd like to see an
              > algorithm whose memory consumption is closer to O(1) than to O(N).

              Omer, here is an O(1) *RAM* algorithm. Note that you still have O(N)
              disk usage, of course - there are N questions, and they need to be saved
              somewhere, and so is the history of the success for them.
              If you drop the O(1) memory requirement (which I think is an overkill for
              any type of modern computer), you can make things simpler.

              Note - I assume the number of probability classes (in your example,
              1, 1/2 and 1/100, if I remember correctly) is constant (O(1)) and not
              O(N).

              Keep on disk a file for each probability class - one list of questions
              with normal probability (1), another list of questions with half
              probability, and a third list of questions with 1/100 probability.
              Each will be a list of numbers (indexing questions stored in a separate
              file).

              In *memory*, keep the number of questions in each class: n1, n2, n3,
              and the partial probability sums:
              p1 = n1*1
              p2 = n1*1 + n2*(1/2.)
              p3 = n1*1 + n2*(1/2.) + n3*(1/100.)

              Each time you want to draw a random question, pick a random number
              between 0 and p3. If it's between 0 and p1, you decided to take a
              question from class 1. If it's between p1 and p2, you decided to take
              a question from class 3. If it's between p2 and p3, then from class 3.

              You decide *which* question to take in the obvious way - e.g., if you
              got a random number p between p1 and p2, then the question number in the
              class 2 list is m=(p-p1)/(1/2). You then need to look at the m'th
              position in the on-disk list of class 2 questions, to get the actual
              index of the question.

              Finally, when the user answers you need to update your data. You need
              to remove the question from the previous class (to do this in O(1) time,
              just move the last element of the list into the newly formed hole),
              and to add it to the new class (put it last), and of course update
              n1,n2,n3 and p1,p2,p3.

              Again, if you don't mind O(N) memory usage - actually just N*4 bytes -
              then I suggest that you do keep the list of questions - just their
              numbers - in memory. Then you won't need to read and write the disk
              all the time.

              Nadav.

              P.S. This is of course a variant on the classic shuffling (or random
              permutation) algorithm. In the shuffling algorithm, you need to draw
              random questions from the list, but after you used a question, you don't
              want to see it again (its probability becomes zero). Instead of keeping
              the used questions on the list (and getting O(N^2) time complexity for
              the whole shuffle operation), the trick is to move them to the end of
              the list, and to remember how many questions we still have (in the
              beginning of the list) to draw from.


              --
              Nadav Har'El | Sunday, Jul 8 2012, 18 Tammuz 5772
              nyh@... |-----------------------------------------
              Phone +972-523-790466, ICQ 13349191 |If Barbie is so popular, why do you have
              http://nadav.harel.org.il |to buy her friends?
            • Omer Zak
              Hello Nadav, Thanks for your answer. I ended up implementing something similar to your proposal. The questions are stored in a SQLite database, so I don t mind
              Message 6 of 7 , Jul 8, 2012
                Hello Nadav,
                Thanks for your answer.

                I ended up implementing something similar to your proposal.
                The questions are stored in a SQLite database, so I don't mind at all
                O(N) disk space consumption. I did want O(1) RAM consumption, as RAM is
                to be considered as a scarce resource in smartphones.

                The number of probability classes is finite (3 if to be exact). In
                principle, it is possible to have a more sophisticated algorithm for
                selecting the next question (with unlimited, time varying probability
                classes) but I don't think that the subject matter warrants this kind of
                sophistication.

                The disk files for each probability class are being kept implicitly - as
                result sets of queries for each probability class.
                The actual performance of the algorithm (O(1), O(N) or whatever)
                actually depends upon the DB performance, which I didn't bother to
                benchmark or fine tune as a function of the number of questions.

                I did not (even implicitly) implement the optimization that you
                suggested for O(1) disk filee modifications.

                --- Omer


                On Sun, 2012-07-08 at 10:23 +0300, Nadav Har'El wrote:
                > On Sat, Jul 07, 2012, Omer Zak wrote about "Re: [hackers-il] Algorithm for randomly choosing a question":
                > > The proposed approach has the problem of memory consumption of order of
                > > O(N) (where N is the number of the questions). I'd like to see an
                > > algorithm whose memory consumption is closer to O(1) than to O(N).
                >
                > Omer, here is an O(1) *RAM* algorithm. Note that you still have O(N)
                > disk usage, of course - there are N questions, and they need to be saved
                > somewhere, and so is the history of the success for them.
                > If you drop the O(1) memory requirement (which I think is an overkill for
                > any type of modern computer), you can make things simpler.
                >
                > Note - I assume the number of probability classes (in your example,
                > 1, 1/2 and 1/100, if I remember correctly) is constant (O(1)) and not
                > O(N).
                >
                > Keep on disk a file for each probability class - one list of questions
                > with normal probability (1), another list of questions with half
                > probability, and a third list of questions with 1/100 probability.
                > Each will be a list of numbers (indexing questions stored in a separate
                > file).
                >
                > In *memory*, keep the number of questions in each class: n1, n2, n3,
                > and the partial probability sums:
                > p1 = n1*1
                > p2 = n1*1 + n2*(1/2.)
                > p3 = n1*1 + n2*(1/2.) + n3*(1/100.)
                >
                > Each time you want to draw a random question, pick a random number
                > between 0 and p3. If it's between 0 and p1, you decided to take a
                > question from class 1. If it's between p1 and p2, you decided to take
                > a question from class 3. If it's between p2 and p3, then from class 3.
                >
                > You decide *which* question to take in the obvious way - e.g., if you
                > got a random number p between p1 and p2, then the question number in the
                > class 2 list is m=(p-p1)/(1/2). You then need to look at the m'th
                > position in the on-disk list of class 2 questions, to get the actual
                > index of the question.
                >
                > Finally, when the user answers you need to update your data. You need
                > to remove the question from the previous class (to do this in O(1) time,
                > just move the last element of the list into the newly formed hole),
                > and to add it to the new class (put it last), and of course update
                > n1,n2,n3 and p1,p2,p3.
                >
                > Again, if you don't mind O(N) memory usage - actually just N*4 bytes -
                > then I suggest that you do keep the list of questions - just their
                > numbers - in memory. Then you won't need to read and write the disk
                > all the time.
                >
                > Nadav.
                >
                > P.S. This is of course a variant on the classic shuffling (or random
                > permutation) algorithm. In the shuffling algorithm, you need to draw
                > random questions from the list, but after you used a question, you don't
                > want to see it again (its probability becomes zero). Instead of keeping
                > the used questions on the list (and getting O(N^2) time complexity for
                > the whole shuffle operation), the trick is to move them to the end of
                > the list, and to remember how many questions we still have (in the
                > beginning of the list) to draw from.
                --
                Sent from a PC running a top secret test version of Windows 97.
                My own blog is at http://www.zak.co.il/tddpirate/

                My opinions, as expressed in this E-mail message, are mine alone.
                They do not represent the official policy of any organization with which
                I may be affiliated in any way.
                WARNING TO SPAMMERS: at http://www.zak.co.il/spamwarning.html
              • Udi Oron
                Hi! ... Keep three lists with pointers to unasked wrong and correct questions, moving questions between them. First choose from which list to pick your
                Message 7 of 7 , Jul 8, 2012
                  Hi!

                  On Sat, Jul 7, 2012 at 7:53 AM, Omer Zak <w1@...> wrote:
                  I want the application to be able to select any question randomly, but
                  with different weights, so that:
                  - A question the user never saw has weight of (say) 0.5.
                  - A question the user answered wrongly has weight of 1 (i.e. such a
                  question is twice more likely to be presented than a question never
                  seen).
                  - A question the user answered correctly has weight of 0.005 (i.e. it is
                  100 times less likely to be seen again than a question never seen
                  before).


                  Keep three lists with pointers to "unasked" "wrong" and "correct" questions, moving questions between them.
                  First choose from which list to pick your question (abuse length of list * weight for each list.)
                  Than pick up a random question from your chosen list.
                  if RAM is a problem, keep the lists offline :-)

                  Udi.
                Your message has been successfully submitted and would be delivered to recipients shortly.