Tuesday, January 15, 2013

Whats the real uptime of Google App Engine?

One of the principal reasons for running your website on Goole App Engine is its high reliability.  Their engineers carry the pagers for you so when something goes wrong at 4am you can stay in bed while they scramble to sort it out.  That is great in theory but how well do they do their jobs?

The Service Level Agreement promises 99.95% "uptime" and defines compensation if that level is not met.  Uptime is defined by Google as more than 10% errors for the datastore or the serving infrastructure.  Most of the services such as the task queue, email, Blobstore and memcache are not covered at all.  They could go down taking your app with them but this is not considered downtime by the SLA.  Also, when the system is running slowly your sites get penalised by Googles ranking algorithm but slow responses are also not covered in the SLA.

Last night my application running on GAE had another outage which seem to be a lot more frequent than I expect.  Each time something happens they analyse the problem, post a message apologising and describe how they fixed something so it won't happen again .... but then something else happens to take the site off-line.

I monitor the site with Pingdom so I can  look at the historical "real uptime" of my app.  This is reported  as 99.87 uptime over the past 60 days.  Over this time the app has not been offline due to application failures - only infrastructure failures.

Slightly below their SLA requirements.

Pingdom also allows me to download the historical data as a csv and analyse it myself.  In the last 60 days they pinged my site 86400 times and it was down 196 times.  That is 0.23% downtime or 99.87% uptime.  Hey thats the same as the official figure!

But what if we add requests that took an unacceptable amount of time to return?  The average request time is about 800ms from the Pingdom servers and anything over 10 seconds is seriously slow.  There were also 270 pings that took longer than 10000ms but were logged as successful.  That would take the amount of "bad requests" to 0.54% or 99.46% uptime.  Requests over 5 seconds were more than twice this amount.  Lucky for Google they do not promise your site will be fast!

All in all, I'm pretty happy with that uptime considering that I no not need to worry about my own infrastructure and software security.  I also have faith that big G will continue to improve and refine their systems or they will lose a lot of business.  I'm probably less sensitive to downtime than a lot of their customers.