Pre-caching Important Pages on Low Volume Websites

I love to start my day by visiting sites I just built. But then my pride would turn into concern when the home page page took 8.. 9.. 10.. seconds to load. Uh-oh.

Why are my sites slow first thing in the morning?  They are extremely low volume website with only a couple of visitors per hour.  I don’t get it.  Is it my host?  Is it a rouge module?  Do I have too many paths defined?  Do I have too many modules turned on?  And… how do I found out?

Through a lot of trial, error and plain observation, I learned that the same problem was occurring in my local, MAMP environment.  OK, so I’ll get off my hosting provider’s back.  I then noticed that it happened for my local installation of Open Atrium.  There’s no way that the good people at Development Seed are using anything other than best practices.

When I would search Google for “Drupal performance”, I was constantly redirected to things like memcache, APC, etc, etc.  All great, but I’m not serving up 50,000 nodes per second – that’s the wrong path for me.

Boost

Then I stumbled up on Boost and learned a couple of things:

  1. When Drupal cron runs, Drupal clears the cache.
  2. Boost wouldn’t help me due to #1 becuse Boost writes static files after the page is initially requested after having been cleared from cache.  In fact, now we have the extra overhead of writing a file.  The same initial slowness is going to occur over my morning coffee.

All I need is a cron job to run after cron clears the cache to trigger the caching of the home page, and with that, the menu system and anything else that Drupal rebuilds at that point!

There are a couple of threads in the Boost issue queue discuss something called “Pre-caching” and another one discussing a “Crawler.”

  1. Auto Regenerate Cache (pre-caching) (the crawler code thread)
  2. Auto Regenerate Cache (pre-caching) ( formerly “Boost export module” )

I got involved in the issue queues and have learned a lot from mikeytown2 who has been incredibly responsive and helpful.

#1 The Crawler Thread

This issue contains code that will allow you to run a PHP script (outside of Drupal) to crawl the site.  While I haven’t yet tested it, I can tell it’s going to be a bit too much for my needs as it will crawl all nodes and has some fancy precautions that will prevent it from bogging down the site.  Plus, it’s not a Drupal module and I would prefer it otherwise.

#2 Boost Export Module

This is a great solution to pre-caching the site, but the only problem is that you have run it manually.  A button is added to the Boost performance page.  When you click that button, it invokes the batch API to hit all your nodes.  You are shown a progress bar and it’s all quite nice, but it doesn’t meet my specific criteria of being run on cron.

Introducing the Pre-cache Module

I’m new to writing modules and to this point have only written patches or simple modules such as hook_form_alter – copy and paste stuff.  How hard could this be?  After having been primed by reading through various module files and the excellent Pro Drupal Development book, it was quite simple, really.

The Pre-Cache module has a cron hook that gets fired off during the normal Drupal cron run.  It visits URLs (using PHP’s file_get_contents() function) that are defined on the module’s settings page (admin/settings/precache).  You simply enter in the URLs you want visited into a textarea field, one per line.  I even threw in permissions settings for good measure, so be sure you set those after enabling the module.

After you have defined at least one URL, go ahead and run cron (there is a link at the top of the module’s settings page).  Then go visit the recent log entries page (yes, that link is offered at the top of the module’s setting page, too).  You will see a log of all the pages that were visited successfully.  If there was a problem, such as a 404 error, you will see that, too.

Keep in mind that this module is not dependent on Boost.  It simply triggers Drupal cache if it is enabled on the Performance settings page.  But if you have Boost enabled, it will produce HTML files just like Boost would.

Cron Performance?

The scope of this module is that it is supposed to solve the “post-cron lag” due to cache being cleared.  It’s not intended to cache all your nodes.  In fact, I wouldn’t recommend defining more than 10 URLs.  (I have noticed  cron take longer and longer with each successive URL added.)  Keep in mind that simply providing the URL to your home page will do a great deal to improve the load time for the person who visits after cron has run.

Collateral Damage?

Not being an expert of the inner workings of Drupal, I am wondering what I might be doing that is outside of best practices.  I am also wondering how my stats are being affected.  Is Google Analytics going to report higher site visits?  Probably.  But I’ll deal with that later.

Feedback

I am very interested in receiving feedback on any of the code or suggestions for next steps.  Just write a comment and I will read them and respond.  I am making this module available here until I figure out the right path forward. Will it become a module of it’s own? Will it be distributed with the Boost module?  All yet to be determined.

Download

July 31, 12 PM EST:  Initial Release

July 31, 6 PM EST:  Field length limit on URLs field was too small.  Removed limi.

August 3, 8:15 AM EST:  Cleaned up some code that was resulting in PHP warnings.  Per mikeytown2’s suggestion, I switched to drupal_http_request() to get pages.

Downoad the Pre-Cache Module >