Loading ...
Sorry, an error occurred while loading the content.

Logging user's movements

Expand Messages
  • ben syverson
    Hello, I m curious how the pros would approach an interesting system design problem I m facing. I m building a system which keeps track of user s movements
    Message 1 of 10 , Feb 4, 2005
    • 0 Attachment
      Hello,

      I'm curious how the "pros" would approach an interesting system design
      problem I'm facing. I'm building a system which keeps track of user's
      movements through a collection of information (for the sake of
      argument, a Wiki). For example, if John moves from the "dinosaur" page
      to the "bird" page, the system logs it -- but only once a day per
      connection between nodes per user. That is, if Jane then travels from
      "dinosaur" to "bird," it will log it, but if "John" travels moves back
      to "dinosaur" from "bird," it won't be logged. The result is a log of
      every unique connection made by every user that day.

      The question is, how would you do this with the least amount of strain
      on the server?

      Currently, I'm using Squid to switch between thttpd (for non-"Wiki"
      files) and mod_perl, with the metadata in MySQL, and the text data in
      flatfiles (don't worry, everything's write-once). The code I'm using to
      generate the "Wiki" pages is fairly fast as I'm testing it, but it's
      not clear (and impossible to test) how well it will scale as more nodes
      and users are added. As a defensive measure, I'm caching the HTML
      output of the mod_perl handler, but the cached files aren't being
      served by thttpd, because the handler still needs to register where
      people are going. So every time a page is requested, the handler looks
      and sees if this user has made this connection in the past 24 hours, if
      not log it, and then either serve the cached file or generate a new one
      (they go out of date sporadically).

      My initial thoughts on how to improve the system were to relieve
      mod_perl of having to serve the files, and instead write a perl script
      that would run daily to analyze the day's thttpd log files, and then
      update the database. However, certain factors (including the need to
      store user data in cookies, which have to be checked against MySQL)
      make this impossible.

      Am I on the right track with this?

      - ben
    • Leo Lapworth
      H ... I think the standard approach for user tracking is a 1x1 gif, there are lots of ways of doing it, here are 2: Javascript + Logs - update tracking when
      Message 2 of 10 , Feb 4, 2005
      • 0 Attachment
        H
        On 4 Feb 2005, at 08:13, ben syverson wrote:

        > Hello,
        >
        > I'm curious how the "pros" would approach an interesting system design
        > problem I'm facing. I'm building a system which keeps track of user's
        > movements through a collection of information (for the sake of
        > argument, a Wiki). For example, if John moves from the "dinosaur" page
        > to the "bird" page, the system logs it -- but only once a day per
        > connection between nodes per user. That is, if Jane then travels from
        > "dinosaur" to "bird," it will log it, but if "John" travels moves back
        > to "dinosaur" from "bird," it won't be logged. The result is a log of
        > every unique connection made by every user that day.
        >
        > The question is, how would you do this with the least amount of strain
        > on the server?
        >
        I think the standard approach for user tracking is a 1x1 gif, there are
        lots of ways of doing it, here are 2:

        Javascript + Logs - update tracking when logs are processed
        ------------------------------------------------------------------------
        ---------

        Use javascript to set a cookie (session or 24 hours) - if there isn't
        already one. Then use javascript to do a document write to the gif.

        so /tracker/c.gif?c=<user_session_id>&page=dinosaur

        It should then be fast (no live processing) and fairly easy to extract
        this information from the logs and into a db.

        Mod_perl - live db updates
        -------------------------------------
        Alternatively if you need live updates create a mod_perl handle that
        sits at /tracker/c.gif, processes the parameters and puts them into a
        database, then returns a gif (I do this, read the gif in and store it
        as a global when the module starts so it just stays in memory). It's
        fast and means you can still get the benefits of caching with squid or
        what ever.

        I get about half a million hits a day to my gif.

        I think the main point is you should separate it from your main content
        handler if you want it to be flexible and still allow other levels of
        caching.

        Cheers

        Leo
      • Malcolm J Harwood
        ... What are you doing with the data once you have it? Is there any reason that it needs to be live ? If not, you could simply add the username in a field in
        Message 3 of 10 , Feb 4, 2005
        • 0 Attachment
          On Friday 04 February 2005 3:13 am, ben syverson wrote:

          > I'm curious how the "pros" would approach an interesting system design
          > problem I'm facing. I'm building a system which keeps track of user's
          > movements through a collection of information (for the sake of
          > argument, a Wiki). For example, if John moves from the "dinosaur" page
          > to the "bird" page, the system logs it -- but only once a day per
          > connection between nodes per user. That is, if Jane then travels from
          > "dinosaur" to "bird," it will log it, but if "John" travels moves back
          > to "dinosaur" from "bird," it won't be logged. The result is a log of
          > every unique connection made by every user that day.

          What are you doing with the data once you have it? Is there any reason that it
          needs to be 'live'? If not, you could simply add the username in a field in
          the logfile, and post-process the logs (assuming you trust the referer field
          sufficiently). That removes all the load from the webserver

          > My initial thoughts on how to improve the system were to relieve
          > mod_perl of having to serve the files, and instead write a perl script
          > that would run daily to analyze the day's thttpd log files, and then
          > update the database. However, certain factors (including the need to
          > store user data in cookies, which have to be checked against MySQL)
          > make this impossible.

          Why does storing user data in cookies prevent you from logging enough to
          identify the user again later? Or are you storing something you need to
          reconstruct the trace that you can't get otherwise?

          --
          "Debugging is twice as hard as writing the code in the first place.
          Therefore, if you write the code as cleverly as possible, you are,
          by definition, not smart enough to debug it."
          - Brian W. Kernighan
        • ben syverson
          First of all, thanks for the suggestions, everyone! It s giving me a lot to chew on. I now realize (sound of hand smacking forehead) that the main problem is
          Message 4 of 10 , Feb 4, 2005
          • 0 Attachment
            First of all, thanks for the suggestions, everyone! It's giving me a
            lot to chew on. I now realize (sound of hand smacking forehead) that
            the main problem is not the list of links and tracking users, but
            rather the inline Wiki links:

            On Feb 4, 2005, at 8:58 AM, Malcolm J Harwood wrote:

            > What are you doing with the data once you have it? Is there any reason
            > that it
            > needs to be 'live'?

            Sort of -- imagine our Wiki scenario, but without delimiters (I think
            this is rather common in the .biz world). So if the "dinosaur" node
            contains:

            "Some scientists suggest that dinosaurs may actually have evolved from
            birds."

            It'll automagically link to the "birds" node. However lets say the node
            "scientist" node doesn't yet exist -- but when it does, we want it to
            link up. I wouldn't say it "needs to be live," but it would be nice to
            get that link happening sooner rather than later.

            The way the system works now, it is live. Every time a page is
            generated, it stores the most recent node ID along with the cached
            file. The next time the page is viewed, it checks to see what node is
            the most recent, and compares it against what was the newest when the
            file was cached. If they're the same, nothing has changed, and the
            cache file is served. If they're different, the system looks through
            the node additions that happened since the node was cached, and sees if
            the original node's text contains any of those node names. If it does,
            it regenerates, recaches and serves the page. Otherwise, it revalidates
            the cache file by storing the new most recent node ID with the old
            cache file, and serves it up.

            The problem with this is that 99% of the time, the document won't
            contain any of the new node names, so mod_perl is wasting most of its
            time serving up cached HTML.

            However, If you use a cron job log-analysis approach, every time a new
            node is added, you have to search through EVERY node's text to see if
            it needs a link to the new node. Image this with 1,000,000 two page
            documents.

            So maybe my system is as optimized as it's going to get?

            - ben
          • Christian Hansen
            ben syverson wrote: [...] ... I have two suggestions, 1) Use a reverse proxy/cache and send proper Cache-Control and Etag/Content-Length headers, eg:
            Message 5 of 10 , Feb 4, 2005
            • 0 Attachment
              ben syverson wrote:

              [...]

              > The problem with this is that 99% of the time, the document won't
              > contain any of the new node names, so mod_perl is wasting most of its
              > time serving up cached HTML.

              I have two suggestions,

              1) Use a reverse proxy/cache and send proper Cache-Control and
              Etag/Content-Length headers, eg:

              Last-Modified: Fri, 04 Feb 2005 11:11:11 GMT
              Cache-Control: public, must-revalidate

              2) Use a 307 Temporary Redirect and let thttpd serve it.

              307 Temporary Redirect
              Location: http://static.domain.com/WikiPage.html


              RFC2616 13 Caching in HTTP
              http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13

              RFC2616 10.3.8 307 Temporary Redirect
              http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3.8

              --

              Regards
              Christian Hansen
            • Perrin Harkins
              ... This is not a bad approach, but there s room for refinement. ... It sounds like the problem is not so much that mod_perl is serving cached HTML, since that
              Message 6 of 10 , Feb 5, 2005
              • 0 Attachment
                ben syverson wrote:
                > The way the system works now, it is live. Every time a page is
                > generated, it stores the most recent node ID along with the cached file.
                > The next time the page is viewed, it checks to see what node is the most
                > recent, and compares it against what was the newest when the file was
                > cached. If they're the same, nothing has changed, and the cache file is
                > served. If they're different, the system looks through the node
                > additions that happened since the node was cached, and sees if the
                > original node's text contains any of those node names. If it does, it
                > regenerates, recaches and serves the page. Otherwise, it revalidates the
                > cache file by storing the new most recent node ID with the old cache
                > file, and serves it up.

                This is not a bad approach, but there's room for refinement.

                > The problem with this is that 99% of the time, the document won't
                > contain any of the new node names, so mod_perl is wasting most of its
                > time serving up cached HTML.

                It sounds like the problem is not so much that mod_perl is serving
                cached HTML, since that is easily improved with a reverse proxy server,
                but rather that your entire cache gets invalidated whenever anyone
                creates a new node, and mod_perl has to spend time regenerating pages
                that usually don't actually need to be regenerated.

                I think you could improve this a great deal just by changing your cache
                invalidation system. When someone creates a new node, rather than
                assuming that anything which was cached before the most recent addition
                is now invalid, try to figure out which nodes are truly affected and
                just invalidate their caches. The way I would do this is by adding
                full-text search capabilities on your data, using something like
                MySQL's text search columns which allow you to index new documents on
                the fly rather than rebuilding the whole index. Then, when someone adds
                a new node called "Dinosaurs", you do a search for all nodes that
                contain the word "Dinosaurs" and invalidate their caches.

                If you want to improve response time even more, you can have a cron job
                that periodically rebuilds anything without a cached version. Since you
                will be invalidating small sets of pages now when someone adds a new
                node rather than the entire site, this will only need to operate on a
                few pages each time it runs. Of course this is not practical if you
                often have people adding new nodes that need to be linked to from
                thousands of pages.

                - Perrin
              • ben syverson
                ... That s not how it works. The entire cache IS invalidated when a new node is added. But when you request one of the nodes, it checks to see what the new
                Message 7 of 10 , Feb 5, 2005
                • 0 Attachment
                  On Feb 5, 2005, at 5:38 PM, Perrin Harkins wrote:

                  > It sounds like the problem is not so much that mod_perl is serving
                  > cached HTML, since that is easily improved with a reverse proxy
                  > server, but rather that your entire cache gets invalidated whenever
                  > anyone creates a new node, and mod_perl has to spend time regenerating
                  > pages that usually don't actually need to be regenerated.

                  That's not how it works. The entire cache IS invalidated when a new
                  node is added. But when you request one of the nodes, it checks to see
                  what the new nodes are. It then searches the node text for those new
                  node names. If there are no matches, it revalidates the cache file
                  (without regenerating it), and serves it. Otherwise, it regenerates the
                  node.

                  To reiterate, the node is NOT regenerated until it actually needs to be
                  -- but it is analyzed on every view to see if this is the case.

                  > The way I would do this is by adding full-text search capabilities
                  > on your data, using something like MySQL's text search columns which
                  > allow you to index new documents on the fly rather than rebuilding the
                  > whole index. Then, when someone adds a new node called "Dinosaurs",
                  > you do a search for all nodes that contain the word "Dinosaurs" and
                  > invalidate their caches.

                  But if you have 1,000,000 documents (or even 10,000), do you really
                  want to search through every single document every time a node is
                  added? Furthermore, do you really want every document loaded into the
                  MySQL database?

                  My thinking is that if you have many documents, odds are only a small
                  subset are being actively viewed, so it doesn't make sense to keep
                  those unpopular documents constantly up-to-date...

                  - ben
                • ben syverson
                  ... I tested this, and it works wonderfully. Thanks Christian! I m still trying to figure out whether it makes sense for my setup -- the inevitable thing I
                  Message 8 of 10 , Feb 5, 2005
                  • 0 Attachment
                    On Feb 4, 2005, at 6:51 PM, Christian Hansen wrote:

                    > 1) Use a reverse proxy/cache and send proper Cache-Control and
                    > Etag/Content-Length headers, eg:
                    > 2) Use a 307 Temporary Redirect and let thttpd serve it.

                    I tested this, and it works wonderfully. Thanks Christian! I'm still
                    trying to figure out whether it makes sense for my setup -- the
                    inevitable thing I left out is that we wanted people to be able to
                    define their own CSS files to "skin" the site (sorry -- there are so
                    many factors and details with this app). So if you really want to cache
                    the HTML, you have to make the href to the CSS go to a universal
                    redirect cgi, which sends the browser to the actual CSS. In other
                    words, now each page view is invoking mod_perl twice, and possibly
                    thttpd two times on top of that -- as opposed to one call to mod_perl.

                    So the question becomes: how much real advantage is there to using
                    thttpd instead of Perl's open() and print()? I know the answer is
                    "benchmarking," but as a final question on this topic, I wanted to see
                    if people had a hypothesis/opinion. Two light hits to mod_perl
                    resulting in one or two thttpd hits, or one heavier(?) hit to mod_perl?

                    - ben
                  • Perrin Harkins
                    ... What I m saying is that you only invalidate the entire cache right now because you have no way of telling which nodes are affected by the change. If you
                    Message 9 of 10 , Feb 6, 2005
                    • 0 Attachment
                      ben syverson wrote:
                      > That's not how it works. The entire cache IS invalidated when a new node
                      > is added.

                      What I'm saying is that you only invalidate the entire cache right now
                      because you have no way of telling which nodes are affected by the
                      change. If you had a full-text index, you could efficiently determine
                      which nodes are affected by a change and only invalidate them.

                      > But when you request one of the nodes, it checks to see what
                      > the new nodes are. It then searches the node text for those new node
                      > names. If there are no matches, it revalidates the cache file (without
                      > regenerating it), and serves it. Otherwise, it regenerates the node.

                      Yes, I understood all of that. That's what I meant by "regenerates."
                      I'm suggesting an approach that lets you skip revalidating, since the
                      cache would only be invalidated on documents that actually contained
                      matches.

                      > But if you have 1,000,000 documents (or even 10,000), do you really want
                      > to search through every single document every time a node is added?

                      Have you ever used an inverted word index? This is what full-text
                      search usually is based on. Searching a million documents efficiently
                      should be no big deal. You also only have to do this as part of the job
                      of creating a new node. You don't need to do it when serving files.

                      > Furthermore, do you really want every document loaded into the MySQL
                      > database?

                      I suggested MySQL as an easy starting point, since it allows incremental
                      updates to the text index. There are many things you could use, and
                      some will have more compact storage than others.

                      > My thinking is that if you have many documents, odds are only a small
                      > subset are being actively viewed, so it doesn't make sense to keep those
                      > unpopular documents constantly up-to-date...

                      You can use this approach for invalidation and still wait until the
                      pages are requested to regenerate them.

                      If the system is running fast enough and not having scalability
                      problems, there's no reason for you to get into making changes like what
                      I'm describing. I thought you were concerned about the time wasted by
                      revalidating unchanged documents, and this approach would eliminate that.

                      - Perrin
                    • ben syverson
                      ... Yes, an unrelated part of the app relies on an inverted word index. This is definitely how I would approach the Wiki aspect of the app, if I was only
                      Message 10 of 10 , Feb 6, 2005
                      • 0 Attachment
                        On Feb 6, 2005, at 11:04 AM, Perrin Harkins wrote:

                        > Have you ever used an inverted word index? This is what full-text
                        > search usually is based on. Searching a million documents efficiently
                        > should be no big deal. You also only have to do this as part of the
                        > job of creating a new node. You don't need to do it when serving
                        > files.

                        Yes, an unrelated part of the app relies on an inverted word index.
                        This is definitely how I would approach the "Wiki" aspect of the app,
                        if I was only matching whole words. However, this implementation needs
                        to be able to match the "perl" in "mod_perl," or the "net" in
                        "cybernetic."

                        - ben
                      Your message has been successfully submitted and would be delivered to recipients shortly.