Loading ...
Sorry, an error occurred while loading the content.
 

OT HELP/ADVICE: Experiences with db's & serving lots of pages

Expand Messages
  • Brat Wizard
    Howdy-- This is an off-topic post-- feel free to ignore it. I know the folks on this list are active web developers and generally smart people and this is a
    Message 1 of 6 , Jan 22, 2002
      Howdy--

      This is an off-topic post-- feel free to ignore it.

      I know the folks on this list are active web developers and generally
      smart people and this is a problem that has been vexing me for some time
      now-- I am at the point of having to do _something_ and I could use some
      advice...

      I am working on a large application which is built on top of postgresql.
      The system is currently spread out over several hosts and will
      (relatively soon) grow to handling many many requests per short
      interval. (As an idea of what the site is/does, think "listings").
      Overall everything works fine. An issue to contend with is fairly
      frequently portions of the data expire as old listings fall off, and new
      listings are added (in hundreds of categories). Another minor wrinkle is
      that the listings need to be accessible from one of about 5 sort orders.

      Currently the lists of listings are statically-cached on the system.
      I have a back-end process that runs periodically to rebuild the static
      caches which basically preprocesses the listings and prepares html
      static "chunks" that can then be included at the appropriate time. This
      I have found to be pretty much the fastest approach from the user's
      point of view, and probably, I'm thinking, the least of the evils as far
      as system load goes. But I'm concerned about several things and so
      I keep mulling over some other ideas and this is what I need advice
      on...

      At the moment, rebuilding the static caches is not a big issue- it
      doesn't take long and its possible to get it all rebuilt in less than
      half an hour. I'm concerned though what the next step will be when the
      number of listings grows to the point that half and hour (or an hour)
      isn't enough. One ameliorating factor, I suspect, will be that the
      number of hosts needed to serve users will continue to grow to the point
      that the sensible thing will be to break up the site to serve up only a
      subset of categories per host (or host cluster) and thus the load will
      be spread out that way. This will somewhat reduce the amount of overhead
      spent rebuilding category listings for a bit-- but ultimately, in my
      reasoning-- the system will reach a point where it is pretty much
      continuously rebuilding category listings and putting a constant strain
      on the main "live" database. So somewhere down the road another solution
      will have to be found. What other solutions are there?

      I have considered an alternative approach to maintaining category
      listings-- instead of pre-generating entire pages as static chunks- what
      about using something like gdbm (or whatever) to make tiny little
      category databases that store pre-generated __rows__ of each page, and
      then pull "x" of them at a time to make a page... has anyone tried
      something like this? What is the performance like? Is there much
      overhead associated with constantly opening and closing these types of
      database files (I'm not real familiar with non-postgres databases)...?
      Would this make it easier to deal with sorting? Ideally it would be nice
      to maintain only one body of category data and just resort it on the fly
      (or maintain indexes) as needed. How much efficiency would be lost
      versus time to serve up pages? I know there has to be a trade-off
      someplace but how bad will it be?

      Is there some other solution I'm missing? Is it faster to access
      something pre-generated from the disk (doing asynch system calls vs.
      lighter-weight network db access calls) to get data? I'm cool with the
      idea that the list of listings does not have to be up to the absolute
      minute accurate- reasonably accurate (say half-hour, hour, two-hours) is
      close enough...

      What are you experiences with this type of situation?

      Thanks for any help or guidance anyone can offer-- there don't seem to
      be a lot of places "out there" that one can go to get this type of
      information.

      Regards,

      John Whitten
      brat@...
      Wizard.Org, Inc.



      ---------------------------------------------------------------------
      To unsubscribe, e-mail: asp-unsubscribe@...
      For additional commands, e-mail: asp-help@...
    • Brat Wizard
      Thanks for taking the time to reply Ray- I like using up simple ideas first too. I m just worried that this project is gonna get big quick and I need to have
      Message 2 of 6 , Jan 22, 2002
        Thanks for taking the time to reply Ray- I like using up simple ideas
        first too. I'm just worried that this project is gonna get big quick and I
        need to have the next several levels of "studliness" rolling around in the
        back of my head now I think ;)

        As you rightly guessed- after many trials using a number of techniques,
        static pages were the one that moved all of the processing overhead
        "someplace else" and leaving pure html "chunks" in its place. I'm not
        quite to the point though of being able to serve it all up statically. I
        have some other processing that needs to be done but Apache::ASP handles
        that admirably.

        (BTW- just in case anybody here wasn't already sure...
        Apache::ASP _ROCKS_ !! :)

        What are your experiences with squid? Have you used it before? What sort
        of improvement have you typically seen by using it?

        As far as cache timeouts- I have a "clever" way of handling that. Each
        category's listings are stored in its own subdir. I simply regen the
        listings in a "new" dir and when they are done, slide the current dir to
        an "old" dir and the "new" dir as the "cur" dir. That way any file handles
        that are open will stay open and will certainly be concluded by the next
        round of processing. The only load is inodes and I think I have enough of
        those.

        My real concern with this method is the constant pounding the drives take
        in the process. Its all asynch system calls (read: can block) and I have
        really wanted to know, all things being equal- which is faster, assembling
        data from a file on disk or retrieving data via a networked database
        (assuming a fast enough backbone). I've never seen a well-crafted study of
        this scenario and no opinions I trust on the subject. One of these days
        I reckon I'll have to model it and find out- but it would be nice to find
        someone who already has an opionion.


        So the next question all this leads to is what is the best "search"
        scenario for a similar setup? Currently I'm just searching the database
        raw. I already know this will grind to a halt quickly. My first plan is to
        implement a quick and dirty inverted index. That should take me a ways
        down the road- how far I'm not sure since this will be my first foray into
        such things. I'm wondering how/where to improve efficiencies here and I
        find myself scratching my head quite a bit. Obviously there has to be some
        limit on returned tuples. However, all the tuples are selected even if
        they are culled out by a 'limit' later. So this part is hard to avoid.
        Building caches on the fly (say using the user's session id as a cache
        identifier) is my next thought but this can get out of hand quickly too-
        but overall seems to be the best way to go as it only has to be done once
        and can be reaped later by the OnSessionEnd() event.

        Any ideas or suggestions on searching? Is this a variation of the same
        problem or worse because its asynchronous?

        John

        Ray Cote wrote:

        > >At the moment, rebuilding the static caches is not a big issue- it
        > >doesn't take long and its possible to get it all rebuilt in less than
        > >half an hour. I'm concerned though what the next step will be when the
        > >number of listings grows to the point that half and hour (or an hour)
        >
        > Well, nothing is going to beat static caches -- particularly if you
        > put a caching server (such as Squid) in front of it.
        >
        > One thing you may want to consider is setting your cache timeouts
        > fairly long (longer than it takes to generate a single category).
        > Then, update a category and clear the cache. While you're generating,
        > you will not be serving any of the new pages since the cache won't be
        > looking yet.
        >
        > Also, simply moving the cache to a separate box may be sufficient to
        > get next bump of speed. Again, the cache is what is actually
        > 'serving' and not your web server with the static pages. Would be
        > interested to see how much of the processor you get to use during
        > page generation vs what your Web server is taking.
        >
        > Along similar lines, you should be able to free up significant disk
        > time (and head seeking) by offloading the cache.
        >
        > Just some initial thoughts on your request. Obviously, there's some
        > more complicated approaches you could take, but I like to run out of
        > simple ideas first. :}
        >
        > Ray
        > --
        > -----------------------------------------------------------------
        > Raymond Cote, President Appropriate Solutions, Inc.
        > www.AppropriateSolutions.com rgacote@...
        > 603.924.6079(v) POB 458, Peterborough, NH 03458 603.924.8668(f)


        ---------------------------------------------------------------------
        To unsubscribe, e-mail: asp-unsubscribe@...
        For additional commands, e-mail: asp-help@...
      • Joshua Chamas
        ... As your data set grows, its better & easier to cache on the fly than to prebuild your cached data. For this purpose, you could use the
        Message 3 of 6 , Jan 22, 2002
          Brat Wizard wrote:
          >
          > Howdy--
          >
          > This is an off-topic post-- feel free to ignore it.
          >
          > I know the folks on this list are active web developers and generally
          > smart people and this is a problem that has been vexing me for some time
          > now-- I am at the point of having to do _something_ and I could use some
          > advice...
          >
          > I am working on a large application which is built on top of postgresql.
          > The system is currently spread out over several hosts and will
          > (relatively soon) grow to handling many many requests per short
          > interval. (As an idea of what the site is/does, think "listings").
          > Overall everything works fine. An issue to contend with is fairly
          > frequently portions of the data expire as old listings fall off, and new
          > listings are added (in hundreds of categories). Another minor wrinkle is
          > that the listings need to be accessible from one of about 5 sort orders.
          >

          As your data set grows, its better & easier to cache on the fly than
          to prebuild your cached data. For this purpose, you could use
          the $Response->Include() cache API to do component based caching,
          which is supported in Apache::ASP 2.29.

          Check out http://www.apache-asp.org/objects.html#%24Response-%3EI3a1a8b77

          Let's say you have a listings page that does an expensive SQL query
          to generate 50 listings. Or a listings page that does 50
          inexpensive SQL queries. Chances are that if you cache the results
          to disk locally, that you have great performance gains, that
          you will take a calculation that might take .25 sec normally, and
          reduce to a .03 second cache fetch. If you use a modern OS that
          handled file buffering for you well, the disk based cache will
          really be cached in local RAM anyway, so can be really fast.

          The trick to use Apache::ASP include caching is to create the largest
          include that can be reused between your web site visitors. So let's
          say this was a real estate listing search, you could:

          $Response->Include({
          File => 'realestate_search.inc',
          Cache => 1, # to activate cache layer
          Expires => 300, # to expire in 5 minutes
          },
          $Request->Form('search_zip')
          );

          The search_zip will be passed into the include as $_[0], or shift(@_)
          and the component cache key will be based on search_zip in part so
          the caching will be per query. This output from this component
          would be fully caching, so no time would even be spent generating
          the HTML from the SQL.

          If you didn't mind spending time generating the HTML, but just
          wanted to cache the SQL results, you could do something like:

          my $rows = $Response->Include({
          File => 'realestate_search_results.inc',
          Cache => 1,
          Expires => 300
          },
          $Request->Form('search_zip')
          );


          and then in the include realestate_search_results.inc, it could ultimately:

          <%
          ...
          return $rows;
          %>

          The returned result rows will also be cached, as the include caching layer
          caches both the output and return value of a cached include.

          What I would not do is cache the results of a simple one row select, as this may
          be even slower than the time the database would spend doing it. Deciding which
          includes to cache must be carefully done. Now if one needed to scale to a
          large cluster, I take it back, then it can be worthwhile to cache even
          simple selects just so you can front end your database with more web servers.
          Caching static SQL table data is a no brainer, but caching dynamic data
          needs to be done with great care.

          --Josh

          _________________________________________________________________
          Joshua Chamas Chamas Enterprises Inc.
          NodeWorks Founder Huntington Beach, CA USA
          http://www.nodeworks.com 1-714-625-4051

          ---------------------------------------------------------------------
          To unsubscribe, e-mail: asp-unsubscribe@...
          For additional commands, e-mail: asp-help@...
        • Joshua Chamas
          ... BTW, if you do decide to use include caching, please read: http://www.apache-asp.org/config.html#Caching as caching will not work if you do not follow
          Message 4 of 6 , Jan 22, 2002
            Joshua Chamas wrote:
            >
            > As your data set grows, its better & easier to cache on the fly than
            > to prebuild your cached data. For this purpose, you could use
            > the $Response->Include() cache API to do component based caching,
            > which is supported in Apache::ASP 2.29.
            >
            > Check out http://www.apache-asp.org/objects.html#%24Response-%3EI3a1a8b77
            >

            BTW, if you do decide to use include caching, please read:

            http://www.apache-asp.org/config.html#Caching

            as caching will not work if you do not follow directions there.

            Also condider a mod_proxy front end to cache whole static HTML
            pages too!

            --Josh
            _________________________________________________________________
            Joshua Chamas Chamas Enterprises Inc.
            NodeWorks Founder Huntington Beach, CA USA
            http://www.nodeworks.com 1-714-625-4051

            ---------------------------------------------------------------------
            To unsubscribe, e-mail: asp-unsubscribe@...
            For additional commands, e-mail: asp-help@...
          • Brat Wizard
            ... Ray- Actually your comments have been greatly helpful. It is very useful to have a dialog with other folks who have actually implemented large database +
            Message 5 of 6 , Jan 23, 2002
                 Hope my comments have been of some small help.
                 Ray


              Ray-

              Actually your comments have been greatly helpful. It is very useful to have a dialog with other folks who have actually implemented large database + web solutions. I know that each one is different, or most of them anyway- and that is one of the things that makes all this challenging.

              In our scenario, we are working on model data (and pre-loaded "live" data) but do not yet have public users so we do not have "live" usage data (history) to look at yet (which always makes the problem even more fun to try to solve ;) So I am attempting to anticipate demand and figure out what usage requirements will be based on the nature of the site (olm listings). Your point regarding seasonal demand is interesting and one that I had not considered. Offhand, I cannot think of many categories of listings that are particularly seasonal in our setup but there probably are some. On the other hand, there probably are areas of the site that are (will be) considerably less trafficked than others. I definately need to rebuild all pages every couple of hours or so but you are absolutely right in pointing out that a scheduled and/or slewed approach might well reduce the loading factor considerably. (Thanks!) In rebuilding listings, I don't get the luxury of regenerating one page or another, rather an entire category at a time.

              From your input, and from that of Josh, Thanos and everyone else who's helped out- it would seem that I'm already on the right track in generating static chunks and what's left to figure out is the best scheduling mechanism (which may in large part not be known until real history can be developed).

              The searching has me more concerned. That one is a lot more difficult to anticipate and control. Initially I think I'm just going to have to take the hit on servers and let it bang on the database directly. I just don't see any other way around it. As the site gets bigger, I think it may work to move searches to another machine and handle them out of a "nearly-live" databas
              -- perhaps even served directly out of ram (...?) and implemented with a lighter-weight faster responding database platform (mysql or dbm perhaps?)

              Since I'm already certain that the system will grow wider and the most obvious split point is across categories, this isn't as unweildy as it may seem at first glance and all that's left would be to implement some back-channel inter-machine communication layer to search across categories that are maintained on other machines and to collect the search results and spool them out to the browser.

              Does this sound feasible/practical?

              John
               
               
               
               

              Ray Cote wrote:

              Hi Brat:

              >What are your experiences with squid? Have you used it before? What sort
              >of improvement have you typically seen by using it?

              Short-hand, Squid delivers as quickly as Apache, just from a
              different machine, thus freeing up your primary server. Plus, since
              Squid actually delivers from RAM, it should be quicker than Apache
              delivering from disk.

              >I've never seen a well-crafted study of
              >this scenario and no opinions I trust on the subject. One of these days
              >I reckon I'll have to model it and find out- but it would be nice to find
              >someone who already has an opionion.

              I do note that one of the other answers on the list suggests letting
              ASP deliver direct out of the database and then cache that. How well
              it works vs static? Expect this to be highly system dependent.
              At present, we have a fairly small search engine (food related,
              http://www.digitalhearth.com/> that generates static pages at night
              for the links. However, one of the things we find is that usage goes
              in cycles. a) seasonal, b) depending on articles other places
              publish, etc.

              Assume you have 1,000 pages (just for discussion). Now ask how many
              of those 1,000 pages are going to be delivered today? If it is all
              1,000 will be hit, then static may work well unless, of course, you
              can maintain all 1,000 in your cache, then we're back to thinking
              dynamic delivery is fine.

              If, however, you find that only 100 of the 1,000 are going to be
              delivered, then you start to lean towards the dynamic delivery
              method. Why generate those other 900 pages if you don't deliver them.

               From your earlier message, you stated that you rebuilt the static
              pages many times during the day. I'd expect you cannot get more than
              20% usage (talking off the top of my head) of those pages. Thus, by
              building static, you're wasting the build time for 80% of your pages.
              Wow!

              >So the next question all this leads to is what is the best "search"
              >scenario for a similar setup?

              Again my question relates to how many different searches do you get?
              We've been discussing similar problems here (not implementing any of
              these, so take these as off-the-cuff suggestions). May be interesting
              to disconnect these caches from the session and just cache them
              globally. Scan back through your search logs and determine what, if
              any, clusters you get. Do 50% of the people search for foo? Then a
              cached foo search will save a lot of work. You can just keep this
              tucked away somewhere and, when you want to 'refresh' the database,
              let your systems know that the search results need to be updated.
              Perhaps some structure with two simple tables (obviously
              implementation would be a bit more robust):

              |SearchID|SearchString |
              |   1    | foo         |
              |   2    | amazing foo |

              |SearchID| Rank | IndexIntoDatabase |
              |   1    |   1  |     80988         |
              |   1    |   2  |     5233          | etc.

              This keeps your search results hanging about in rank order for each
              specific search string. May also want to maintain a third table:

              |SearchID|Frequency|
              |   1    |   8342  |
              |   2    |     5   |

              which tells you how often each of the searches was requested. Then
              when you decide to refresh the queries, you could always prime this
              table with the top 100 searches so they respond quickest.

              I have to say I really enjoy these types of problems because there is
              simply no correct answer. There is only the ability to tune your
              solution against field demand. A never ending problem.

              Hope my comments have been of some small help.
              Ray

              --
              -----------------------------------------------------------------
              Raymond Cote, President                 Appropriate Solutions, Inc.
              www.AppropriateSolutions.com       rgacote@...
              603.924.6079(v)  POB 458, Peterborough, NH 03458    603.924.8668(f)


               
            • Joshua Chamas
              ... One common technique I use for handling searches against a database is to acquire an exclusive lock, so that no more than one search can run at a time on
              Message 6 of 6 , Jan 23, 2002
                > The searching has me more concerned. That one is a lot more difficult to anticipate and control. Initially I think I'm just going to have to take
                > the hit on servers and let it bang on the database directly. I just don't see any other way around it. As the site gets bigger, I think it may work
                > to move searches to another machine and handle them out of a "nearly-live" databas
                > -- perhaps even served directly out of ram (...?) and implemented with a lighter-weight faster responding database platform (mysql or dbm perhaps?)
                >

                One common technique I use for handling searches against a database is to
                acquire an exclusive lock, so that no more than one search can run at
                a time on the system. For a system that does more than searching, this
                ensures that no matter how many searches have queued up, the rest of the
                site will still perform reasonably, and for a multiprocesssor database will
                further ensure that the searching does not use more than 1CPU at a time.

                In MySQL, I typically do this with GET_LOCK() user locks. In Oracle I
                tend to do this by acquiring a row level lock somewhere relevant with
                a select ... for update, and then commit to allow another search to begin.

                With this kind of queue forming, the next thing to worry about
                is that you have enough mod_perl servers to wait in this queue,
                so you cannot get away with just 5 backend mod_perl servers in
                a dual mod_proxy/mod_perl type of config, but rather, you would
                probably need 20+ mod_perl servers at the very minimum so when
                the queue builds up it does not starve your entire site.

                --Josh

                _________________________________________________________________
                Joshua Chamas Chamas Enterprises Inc.
                NodeWorks Founder Huntington Beach, CA USA
                http://www.nodeworks.com 1-714-625-4051

                ---------------------------------------------------------------------
                To unsubscribe, e-mail: asp-unsubscribe@...
                For additional commands, e-mail: asp-help@...
              Your message has been successfully submitted and would be delivered to recipients shortly.