varnish-cache + WAMP + NodeJS + Socket.IO

Posted at

You can run Node.js on top of both Nginx and Apache. In fact, both of those servers are frequently used to reverse-proxy to Node apps. For example, see the writeup on using nginx with node here. The reason why PHP is much more common on shared hosting is because its been around for longer. Node was released in 2009, PHP has been around since 1995. In this time hosts have had time to implement support, and haven't had much reason to bother supporting other languages.

There are other ways to deploy node.js apps. You can use PaaS services, like Openshift, Heroku, Appfog, etc.

Node doesn't work like most servers. With IIS and Apache, there is one server running multiple sites, which lends itself to shared environments. With Node, you're running your own server so instead you tend to share resources on a machine.

I can't tell you whether it's worth learning node because I don't know your motivation, but it can expand your career opportunities if you choose to go there, and to expand your skillset.

Here are a couple of hosting options in the low price range.





Optimising NginX, Node.JS and networking for heavy workloads

Optimising for scale

Used in conjunction, NginX and Node.JS are the perfect partnership for high-throughput web applications. They're both built using event-driven design principles and are able to scale to levels far beyond the classic C10K limitations afflicting standard web servers such as Apache. Out-of-the-box configuration will get you pretty far, but when you need to start serving upwards of thousands of requests per second on commodity hardware, there's some extra tweaking you must perform to squeeze every ounce of performance out of your servers.

This article assumes you're using NginX's HttpProxyModule to proxyyour traffic to one or more upstream node.js servers. We'll cover tuning sysctl settings in Ubuntu 10.04 and above, as well as node.js application and NginX tuning. You may be able to achieve similar results if you're using a Debian Linux distribution, but YMMV if you're using something else.

Tuning the network

Meticulous configuration of Nginx and Node.js would be futile without first understanding and optimising the transport mechanism over which traffic data is sent. Most commonly, NginX will be connected to your web clients and your upstream applications via TCP sockets.

Your system imposes a variety of thresholds and limits on TCP traffic, dictated by its kernel parameter configuration. The default settings are designed for accommodating generic networking use. They are not necessarily geared up for high-volumes of short-lived connections handled by a web server.

The parameters listed here are the main candidates for tuning TCP throughput of a server. To have these take effect, you can drop them in your /etc/sysctl.conf file, or a new config file such as /etc/sysctl.d/99-tuning.conf and run sysctl -p to have the kernel pick them up. We use asyctl-cookbook to do the hard work.

Please note the following values are guidelines, and you should be able to use them safely, but you are encouraged to research what each one means so you can choose a value appropriate for your workload, hardware and use-case.

net.ipv4.ip_local_port_range='1024 65000'
net.ipv4.tcp_rmem='4096 87380 16777216'
net.ipv4.tcp_wmem='4096 65536 16777216'

Higlighting a few of the important ones…


To serve a client request via an upstream application, NginX must open 2 TCP connections; one for the client, one for the connection to the upstream. If the server receives many connections, this can rapidly saturate the system's available port capacity. The net.ipv4.ip_local_port_range directive increases the range to much larger than the default, so we have room for more allocated ports. If you're seeing errors in your /var/log/syslog such as: "possible SYN flooding on port 80. Sending cookies" it might mean the system can't find an available port for the pending connection. Increasing the capacity will help alleviate this symptom.


When the server has to cycle through a high volume of TCP connections, it can build up a large number of connections in TIME_WAIT state. TIME_WAIT means a connection is closed but the allocated resources are yet to be released. Setting this directive to 1 will tell the kernel to try to recycle the allocation for a new connection when safe to do so. This is cheaper than setting up a new connection from scratch.


The minimum number of seconds that must elapse before a connection in TIME_WAIT state can be recycled. Lowering this value will mean allocations will be recycled faster.

How to check connection status

Using netstat:

netstat -tan | awk '{print $6}' | sort | uniq -c

Using ss:

ss -s


As load on our web servers continually increased, we started hitting some odd limitations in our NginX cluster. I noticed connections were being throttled or dropped, and the kernel was complaining about syn flooding with the error message I mentioned earlier. Frustratingly, I knew the servers could handle more, because the load average and CPU usage was negligible.

On further investigation, I tracked down an extraordinarily high number of connections idling in TIME_WAIT state. This was ss output from one of the servers:

ss -s
Total: 388 (kernel 541)
TCP: 47461 (estab 311, closed 47135, orphaned 4, synrecv 0, timewait 47135/0), ports 33938

Transport Total IP IPv6
* 541 - -
RAW 0 0 0
UDP 13 10 3
TCP 326 325 1
INET 339 335 4
FRAG 0 0 0

47,135 connections in TIME_WAIT! Moreover, ss indicates that they are all closed connections. This suggests the server is burning through a large portion of the available port range, which implies that it is allocating a new port for each connection it's handling. Tweaking the networking settings helped firefight the problem a bit, but the socket range was still getting saturated.

After some digging around, I uncovered some documentation about an upstream keepalive directive. The docs state:

Sets the maximum number of idle keepalive connections to upstream servers that are retained in the cache per one worker process

This is interesting. In theory, this will help minimise connection wastage by pumping requests down connections that have already been established and cached. Additionally, the documentation also states that the proxy_http_version directive should be set to "1.1" and the "Connection" header cleared. On further research, it's clear this is a good idea since HTTP/1.1 optimises TCP connection usage much more efficiently than HTTP/1.0, which is the default in Nginx Proxy.

Making both of these changes, our upstream config looks more like:

upstream backend_nodejs {
server nodejs-3:5016 max_fails=0 fail_timeout=10s;
server nodejs-4:5016 max_fails=0 fail_timeout=10s;
server nodejs-5:5016 max_fails=0 fail_timeout=10s;
server nodejs-6:5016 max_fails=0 fail_timeout=10s;
keepalive 512;

I made the recommended changes to the proxy directives in the server stanza. While I was at it, I added a proxy_next_upstream directive to skip out-of-service servers (helping with zero-downtime deploys), tweaked the client keepalive_timeout, and disabled all logging. The config now looks more like:

server {
listen 80;

client_max_body_size 16M;
keepalive_timeout 10;

location / {
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_set_header Connection "";
proxy_http_version 1.1;
proxy_pass http://backend_nodejs;

access_log off;
error_log /dev/null crit;

When I pushed out the new configuration to the nginx cluster, I noticed a 90% reduction in occupied sockets. Nginx is now able to use far fewer connections to send many requests. Here is the new ss output:

ss -s

Total: 558 (kernel 604)
TCP: 4675 (estab 485, closed 4183, orphaned 0, synrecv 0, timewait 4183/0), ports 2768

Transport Total IP IPv6
* 604 - -
RAW 0 0 0
UDP 13 10 3
TCP 492 491 1
INET 505 501 4


Thanks to its event-driven design that can handle I/O asynchronously, Node.js is geared up to handle high volumes of connections and requests out of the box. There are additional tweaks and observationsthat can be made. We're going to focus on node.js processes.

Node is single-threaded and doesn't automatically make use of more than a single core in your potentially multi-core machine. This means unless you design it differently, your application won't take full advantage of the available capacity the server hosting it has to offer.

Clustering node processes

It's possible to modify your application so that it forks a number of threads that all accept() data on the same port and efficiently load balance it across multiple CPU cores. Node has a core module calledcluster that offers all of the tools you need to achieve this, however it'll take a bit of elbow grease to work it into your application. If you're usingexpress, eBay have built a module for it called cluster2.

Beware of the context switch

When running multiple processes on your machine, try to make sure each CPU core will be kepy busy by a single application thread at a time. As a general rule, you should look to spawn N-1 application processes, where N is the number of available CPU cores. That way, each process is guaranteed to get a good slice of one core, and there's one spare for the kernel scheduler to run other server tasks on. Additionally, try to make sure the server will be running little or no work other than your Node.JS application, so processes don't fight for CPU.

We made a mistake where we deployed two busy node.js applications to our servers, both apps spawning N-1 processes each. The applications' processes started vehemently competing for CPU, resulting in CPU load and usage increasing dramatically. Even though we were running these on beefy 8-core servers, we were paying a noticeable penalty due to context switching. Context switching is the behaviour whereby the CPU suspends one task in order to work on another. When context switching, the kernel must suspend all state for one process while it loads and executes state for another. After simply reducing the number of processes the applications spawned such that they each shared an equal number of cores, load dropped significantly:


Notice how load decreases as the number of processes running (the blue line) crosses beneath the number of CPU cores (the red line). We noticed a similar improvement across all other servers. Since the amount of work spread between all the servers remained constant, this efficiency must have been due to reduced context switch overhead.

Additional links and references




5 steps to making a Node.js frontend app 10x faster

How we made GoSquared 10x faster

Making GoSquared Dashboard faster

Over the last couple of months we've done a huge amount to make Dashboard (the node.js application that powers the Now, Trends and Ecommerce front-ends) much faster. Here's a brief summary of our story in making that happen.

What it used to be like

Back in November, loading any dashboard would take anywhere upwards of 30 seconds. Simply loading the HTML page itself would take a minimum of 10 seconds, then the application would request several other JavaScript and CSS files, each with a response time averaging 5 seconds.

Clearly this was not acceptable, so we set about doing everything we could think of to make things faster.

Step 1: Parallelize Everything

In order to render the HTML page for any dashboard, the node.js application needs to retrieve a lot of data for the dashboard in question.

At minimum this means it needs to retrieve the data from the user's current browsing session to check they're logged in and it needs to pull in data about the user (e.g. the user's name, which sites they have access to, their API key and the parameters of their GoSquared subscription), and about the site in question for the dashboard (site name, unique token etc).

In order to retrieve this data, the application needed to make several calls to internal API functions, many of which could take up to 2 seconds to complete. Each request was made by a separate Expressmiddleware, which meant they were running in series. Each request would wait for the previous one to complete before starting.

Since node.js is perfectly suited to running multiple asynchronous functions in parallel, and since a lot of these internal API requests didn't depend on each other, it made sense to parallelize them — fire off all the requests at once and then continue once they've all completed. We achieved this with the aid of the (incredibly useful) async module:

So instead of:


… we could do something like this:

function parallel(middlewares) {
return function (req, res, next) {
async.each(middlewares, function (mw, cb) {
mw(req, res, cb);
}, next);


Straight away this cut our average response time down from 10 seconds to roughly 1.5 seconds. But we knew we could still do better.

Step 2: Cache, Cache, Cache

Even once we'd parallelized all of our internal data-fetching, loading a dashboard was still pretty slow. The reason for this was because not only was the application fetching all this data for the initial page load, it was also fetching it for a lot of subsequent JavaScript requests (at this point we were still limiting widgets in the dashboard based on GoSquared plan, so we needed to restrict who had access to which resources). And every one of these subsequent requests also had an average response time of about 1.5 seconds.

The solution to this was to cache any fetched data that wasn't likely to change. A user isn't going to upgrade or downgrade their GoSquared subscription in the 2 seconds it takes for the dashboard to load its JS, so there's no point fetching subscription data again if we've already fetched it once.

So, we went ahead and cached all the data we could, cutting response times down from 1.5 seconds to about 500ms on any requests which already had the data cached.

Step 3: Intelligent JS and CSS loading on the front-end

The front-end of the dashboard application has a lot of interconnected components. The JavaScript for the application falls into three main parts: libraries (such as jQuery, D3 etc.), the main application core, and widgets (each widget in the application is modularised and has its own code). Code in each of these parts is edited in very different ways: libraries are barely touched and are updated at most once a month; the core is changed several times a day; widgets can vary from receiving several changes in a day to not being touched in weeks.

Originally we bundled all our libraries into the core application bundle (which was included via a script tag on the page), and all of the widgets into a secondary bundle which was loaded dynamically. This meant that even with good cache control, any tiny change we made to the core code would mean browsers would have to download all of the (unchanged) library code, or any change to one widget would require downloading allof the widgets again.

One way around this problem would be to break each individual component into its own file and include them all individually — that way any files that don't get changed frequently can sit in the browser's HTTP cache and not be requested. The problem with this, though, is that there would be a lot of files, some of them incredibly small. And (especially on mobile browsers), the overhead of loading that many individual resources vastly outweights the extra overhead we had before of re-downloading unchanged content.

We eventually came up with a compromise solution based on Addy Osmani's basket.js, using a combination of server-side script concatenation and localStorage for caching. In a nutshell, the page includes a lightweight loader script, which figures out which JS and CSS it has already cached and which needs to be fetched. The loader then requests all the resources it needs from the server in one request, and saves all the resources into localStorage under individual keys. This gives us a great compromise between cutting down the number of HTTP requests while still being able to maintain cacheability, and not re-downloading code unnecessarily when it hasn't changed. Addtionally, after running a few benchmarks, we found that localStorage is (sometimes) actually faster than the native HTTP cache, especially on mobile browsers.

Along with this, we also switched all of our static (JS and CSS) asset loading to be served through CloudFront, Amazon Web Service's content delivery network. This means content is served from the nearest possible geographic location to the user, cutting down request latency from as high as 2500ms (in places such as Singapore) to tens of milliseconds.

We also introduced some optimizations to prevent loading or storing duplicate code. For example, the Languages widget uses exactly the same code in Now, Trends and Ecommerce. By de-duplicating the caching and requests based on a digest of each resource's contents, we were able to cut out unnecessary requests and storage.

With these intelligent changes to resource loading we were able to cut down the total number of HTTP requests necessary to render the dashboard to one (just the page itself), which meant that for users quickly switching between dashboards for different sites, each dashboard would load within a few seconds.

But we could do even better.

Step 4: Cut out the middle-man for fetching data

All the user, site and subscription data described in the first two steps was being fetched via a secure internal HTTP API to our internal account system, which at the time was written in some old, clunky, slow PHP. As part of our extensive rewrite of that whole system from PHP to Node, we were also able to cut out the internal HTTP component completely, instead including a node module directly in the dashboard application and requesting our databases directly. This allowed us much finer-grained control over exactly what data we were fetching, as well as eliminating a huge amount of overhead.

With this significant change, we were able to reduce our average response time (even without the caching described in Step 2), to 25ms.

Step 5: Do More on the Client

Thanks to all the changes we'd made up to this point, all that was different between different dashboards for different sites was a config object passed to the loader on initialization. It didn't make sense, therefore, to be reloading the entire page when simply switching between sites or between Now and Trends, if all of the important resources had already been loaded. With a little bit of rearranging of the config object, we were able to include all of the data necessary to load any of the dashboards accessible to the user. Throw in some HTML5 History with pushState and popState, and we're now able to switch between sites or dashboards without making a single HTTP request or even fetching scripts out of the localStorage cache. This means that switching between dashboards now takes a couple of hundred milliseconds, rather than several seconds.

What else?

So far all this has been about reducing load times and getting to a usable dashboard in the shortest time possible. But we've also done a lot to optimise the application itself to make sure it's as fast as possible. In summary:

  • Don't use big complex libraries if you don't have to — for example, jQuery UI is great for flexibility and working around all manner of browser quirks, but we don't support a lot of the older browsers so the code bloat is unnecessary. We were able to replace our entire usage of jQuery UI with some clever thinking and 100-or-so lines of concise JS (we also take advantage of things like HTML5's native drag-and-drop).

  • Even respectable libraries have their weak spots — for example we use moment with moment-timezone for a lot of our date and time handling. However moment-timezone is woefully inefficient (especially on mobile) if you're using it a lot. With a little bit of hacking we added a few optimizations of our own and made it much better for our use-case.

  • Slow animations make everything feel slow — a lot of studies have been posted about this in the past, and it really makes a difference. Simply reducing some CSS transition times from 500ms to 250ms, and cutting others out entirely, made the whole dashboard feel snappier and more responsive.

  • Instant visual feedback — one of the big things we found when using Trends was that switching between time frames just felt slow. It took under a second, but because there was a noticeable delay between clicking on the timeframe selector and anything actually happening, things felt broken. Fetching new data from our API is always going to take some time — it's not going to be instant. So instead we introduced the loading spinner on each widget. Nothing is actually any faster, but the whole experience feels more responsive. There is immediate visual feedback when you click the button, so you know it's working properly.

  • Flat design is actually really handy for performance — it may well just be a design trend, but cutting out superficial CSS gradients and box shadows does wonders for render performance. If the browser doesn't have to use CPU power to render all these fancy CSS effects, you get an instant boost to render performance.

Now dashboard in action

What next?

Even after all these optimizations and tweaks, we're well aware that there's still plenty of room for improvement. Especially on mobile, where CPU power, memory, rendering performance, latency and bandwidth are all significantly more limited than they are on the desktop. We'll continue improving everything we can, so stay tuned for further updates!

So long, Campfire. Hello Slack!

How we completely changed our internal communication in one morning

Moving from Campfire to Slack

Here at GoSquared, we've long been fans of Campfire by 37Signals. We've been using it for a couple of years for our internal team chat, along with the help of our friendly company Hubot (an awesome chatbot made by GitHub). However, there have always been a couple of things we haven't been so keen on: there's a roughly five-second delivery lag on messages (which means you can be waiting about ten seconds for a reply from Hubot); there's no support for one-on-one chats or private groups (we used Skype for this); the Jenkins Plugin(which we used to track all our code builds and deploys) has beenbroken for months; the official first-party is, well, a bit rubbish (although Flint for Mac and iOS is excellent); and there's no native app for Android at all.

So, with the news that 37signals is no longer focusing on Campfire, we took the opportunity to investigate other options.

Enter Slack

Slack is a relatively-new (only just today out of limited preview) communication platform that serves as a drop-in replacement for Campfire. We've actually been on the limited preview since November, but we thought we'd give it a shot and experiment with it. In a somewhat bold move last Friday, we decided to ditch Campfire completely for the morning and communicate only in Slack, to see how easy it was to use. And, well, we never looked back!

Slack is everything we need for team communication. It has group chats (or "channels"), which you divide by purpose (for example, we have a development channel, a design channel, a channel for random silly stuff, and several others). It has one-on-one direct messages, meaning we can do away with Skype and communicate solely in one place. It has native Mac, iOS and Android apps.

And it's fast. Boy is it fast. It may not seem like much, but coming from Campfire, where every message would take several seconds to be delivered (a bit ridiculous if you're just chatting to the person sitting right next to you), having a chat where messages come through as good as instantaneously, or where you only have to wait half a second for a reply from Hubot, makes a huge difference.


Hubot integrate all of the things

The best part of all is all the available service integrations. We now have a one-stop-shop with all our GitHub activity, Jenkins builds, PagerDuty notifications, customer support and much more. We even got to keep Hubot and all our custom commands thanks to their Hubot adapter. And it was all ridiculously easy to set up. Not only did we switch our whole team to Slack in that one morning, we also got almost all of our core service notifications set up too. And we're still adding the odd extra one every now and again.

It's still early days for GoSquared and Slack, but we love what we see so far. We've been using it for less than a week and it's already completely reinvented the way we communicate as a team.

Stay tuned to the Engineering Blog for updates on how we use Slack as a team. Also keep an eye out if you're more interested in the technical stuff, like our service integrations or how we use Hubot — we'll be posting more abuot that in the next few days!

P.S. Just yesterday, Slack launched publicly – if you'd like to get $100 of service credit, and reward us with the same, then sign up here and do us both a favour.

Using secure sessions behind an HTTP proxy

Making things more secure

GoSquared is served entirely via HTTPS, so it was a logical and easy decision to modify our user sessions to use secure cookies. A couple of lines of configuration later, and we were good to go.

Not quite.

We use Node.js extensively, and Connect.session, which is used by Express, will refuse to set secure cookies when the connection isn't encrypted (req.connection.encrypted) unless the option of proxy is set to true and the x-forwarded-proto is https. This is not the case with standard secure cookies, but it's been coded into Connect probably for security reasons.

Why does this matter? Isn't everything is served via https anyway?

Of course, but everything is also served via an ELB which proxies to our nginx cluster, which in turn proxies to our apps servers via internal http connections. The fix is trivial as it's easy to set/modify headers in nginx, making the header validation in Connect quite pointless –proxy_set_header x-forwarded-proto https;.

In completely unrelated news, sessions on GoSquared now use secure + httponly cookies!

PS. remember to add proxy_set_header Host $host; too if you need the host header to be forwarded too, it appears to get lost otherwise.

How we made the GoSquared site 4x faster

Fast analytics needs a fast marketing site

Making the GoSquared site 4x faster – the new static website

It's a well known fact that a static website will usually be faster than having everything running through a server-side language, like PHP or Ruby. Until a few weeks ago, the majority of the GoSquared site was still being generated by PHP code. Some of this code was as old as GoSquared itself and most of it had become obsolete, unnecessarily slowing down the site.

We recently replaced all this with a brand new static site.


Building your website using PHP encourages you to put a lot of logic and functionality within an HTML page, rather than thinking through exactly why that functionality is needed and how to implement it properly.

We're also huge Node.js fans at GoSquared, and we wanted to have all of the account logic and administration in a Node.js application, which you'll be able to read more about in future posts.


In HTML, CSS and JavaScript, we follow the concept of Separation of Concerns (keep all files separate and detached) to improve the maintainability of code – there shouldn't be any inline CSS in your HTML. But we did have a huge amount of inline PHP, which is potentially even worse than any of the others.


Static HTML files are a lot faster than dynamic pages. Since release, we've seen a very noticeable decrease in page load times. We were also able to minify and compress all of the assets that make up those pages. View the source if you want to see – HTML is 20% smallerCSS 17% smaller and JavaScript a massive 40% smaller. This has a huge impact, especially for our mobile users.

PHP can also be slow, and we knew we could speed things up by writing all admin functions in Node.js and making as much as possible asynchronous and parallel, while removing all PHP from the pages so there will be no server processing before page load.

Uptime, costs and scalability

S3 has an availability of ~99.99% and can handle any amount of traffic, so unless we make a serious mistake, the site should never go down due to high traffic. Having a few static files on S3 is a lot cheaper than running a small cluster of PHP servers, and much easier to manage.

We do still route all traffic through our nginx cluster, but that scales incredibly well. See our post on scaling nginx with Node.js.



Serenity is a static site generation framework written in Node.js (open source and available on NPM). I wrote Serenity as there was a lack of tools for developers who like to use EJS and JSON (rather than liquid and YAML). It's basically Jekyll for JavaScripters.

It allows easy asset management, compression and is optimised for CDNs by providing asset path helpers which add a hash to the asset name so long cache TTLs can be used.

PHP include footer.php was replaced with <% include footer %> and there were the additional added benefits of layouts, templates, inline assets and much more.

We used layouts and includes extensively to provide a consistent site look and feel across all of the pages. One update to the signup form include will update every form on the entire site and regenerate the HTML accordingly.


Node.js is fast, we already know that. It's even faster if the functionality is planned and code is written to play to the strong points.

Account actions, such as logging in or signing up, are 4x faster than before. We were able to utilise the asynchronous nature of Node.js (and keep it tidy by using async) and therefore were able to fire off multiple concurrent MySQL queries (using node-mysql's connection pooling). Again, more on this later.


The use of AJAX for logins and signups has made a significant difference too. There's no longer a page reload before being told that you've mistyped your password.


We've really seen the benefits of converting our public site to be completely static. It's faster, and our users' experience is ultimately more important than the underlying code, so it's a success already. On top of that, we've managed to build a maintainable system which we can update and improve long into the future.

What are your thoughts? Let us know @GoSquared on Twitter.

Making mobile web apps feel faster

Using TouchStart events for speed

The new GoSquared Analytics Dashboard running on iOS on iPhone 5 - responsive design with touchstart events in action

Interacting with mobile web apps often feels sluggish and slow, but our phones are more than capable of handing the applications.

When you touch a button, you expect it to perform its action immediately. However, a click event listener will only trigger when the touch finishes (touchend), rather than when it begins (touchstart). This delays every action unnecessarily.

TouchStart to the rescue

The easy solution, then, is to bind all your click events to touchstart too.

Oops, we've just made every event trigger twice on touch devices. The solution is also simple, storing a property on the event object to check if it's already been triggered. Luckily, the event object is the same across both triggers, otherwise things would be a lot more difficult.

Quick Example

Old code

$('.my-button').on('click', function(e) {

New Code

$('.my-button').on('click touchstart', function(e) {
    if (e.handled) return;

    e.handled = true;


Easy, right? At GoSquared we've made sure we use this solution in as many places as possible, especially our new menubar in GoSquared Beta. Apply for early access!

Note: don't add the touchstart listener to very important buttons, such as account settings, as if the user starts scrolling down with their finger over the button, it'll trigger when they don't expect it.

Introducing Ribbon

A lightweight Node.js service wrapper

This morning the silence of my room, and subsequently the serenity of my rare hours of sleep were suddenly pierced by my iPhone's ringtone at 5am. On the other end of the line was the mechanised TTS-voice of our server monitoring system, informing me that our DBs had noticed a drop in connections.

As was my duty, I hopped on to find the source of the issue. It turned out to be faulty reconnection logic in one of many adapters we had developed for managing connections to MySQL servers in our applications. It was a bug that still existed simply because we had too many of these adapters lying around to feasibly maintain, and this one slipped through.

Over many months, I've noticed that the various clients on offer for connecting to and working with external services, like Redis, MongoDB, RabbitMQ, MySQL etc. all share very similar concepts: a connection is established to a service running on a server, and depending on the state of that connection, you interact with that service using the client.

However, it's apparent that while sharing similar concepts (such as a connection must be active before querying, and you won't get anywhere with a dead connection), these modules didn't stick to a consistent methodology for your application to understand and respond to changes in state. They'd often use different event names to describe the same states (such as 'connected', 'ready' 'open' to represent an active connection, and 'close', 'end', 'exit', 'error' to denote an inactive connection). Worse still, they commonly lacked convenience methods to check the state, such as .isUp() or .isDown().

That's why I wrote Ribbon

Predominantly formulated during the small hours of this morning,Ribbon is simply defined as a lightweight Node.js service wrapper. It provides a standard set of events and methods for use in your applications to react to and verify the status of external services, as an alternative to dealing with various different terminology and definitions for common state events.

Ribbon instead exposes a set of standard events, such as simply 'up' for connection available, and 'down' for connection unavailable.

Ribbon includes a collection of adaptors which inherit the ribbon interface. An adaptor is a controller for your underlying connection or service, known as the client. The adaptor is the interface between the service's client and ribbon.

Adaptors are bundled as part of Ribbon, and I'd be grateful to anyone contributing adaptors for their use case. I'll be putting together a guide to building an adapter before too long.

See the repo for more info + README.

10 Vital Aspects of Building a Node.JS Application

The 10 commandments of Node


It sounds antagonisingly obvious, but the same goes for everything you decide to build. Your app needs to have purpose. A job to do. A problem to solve. Solid reasoning here will cement durable foundations for the application itself. It will help you visualise a path towards the solution, as well as maintaining focus on the ultimate goal when you get stuck in and iterate.


Structure concerns source code layout, file arrangement, library/module usage and on the whole describes the way the application has been weaved together. Its form will vary greatly depending on the nature of the app you're building. It could be a web application server, which might be an express app handling static assets, routing, and application logic all in the same app. Or it may be more like a scheduler/worker pipeline, queueing and processing items from a queue. Regardless of the purpose, there are several common patterns that are useful to follow.

  • Modularity. Try to keep your code as DRY (Don't Repeat Yourself) as reasonably possible. If you realise you're going to be needing similar code in many distinct locations or scripts, it's common to drop the function (or 'helper') in a separate file (or module), which mayexport a collection of helper functions. The module can then be included via node's require() in all dependant scripts. The aim is to not only avoid rewriting similar functionality multiple times, but also provide an easy way to update the functionality in a single place.

  • Following node conventions, it's common to keep the files for 3rd party node modules in the node_modules/ folder. It's also common to put node_modules in your .gitignore, so that you're not committing irrelevant dependency files.

  • Separate concerns. Assets for frontend (static CSS, javascript, HTML, templates, images, files) should be isolated from backend application logic (routes, server, middleware). Likewise keep deployment scripts, config files, data fixtures and tests separate.


Your method of shipping applications to production can vary greatly depending on the nature of your stack. Here's what we've tried in the past:

  • Manually SSH'ing into servers and cloning the git repository. Pros: full manual control, zero deployment tooling setup. Cons: Completely unfeasible for a large number of servers. Everything must be set up manually, so you don't get any benefits such as upstart / initrc supervisation or logging without work.

  • Capistrano. Pros: Standard procedure for developers on your team. Simple to run: cap deploy. Cons: Trickier to set up. Introduces Ruby dependency.

  • Chef scripts. Pros: Scripted procedures for installing apps. Cons: Need to cook your servers each time you want to deploy. Chef is best used for server installation / config, not app deploys.

  • Deliver. A deploy tool born at GoSquared when we got sick of battling with the other options. It was inspired by Heroku's git-push based deploy system. All you need to do is configure a system user for the application (which you should do anyway – we use chef to automate this), set up a basic deliver config, and then run the delivercommand in your project (after adding it to your $PATH – see deliver setup instructions). It pushes the application to your server(s) using git via SSH, and can use foreman or equivalent to install upstart supervisation of application booting, respawning and recovery.

This is by no means an exhaustive list of deploy methods, and you may need to be a bit creative to come up with a solution to best fit your needs. Whatever strategy you employ, it's a good idea to include deploy configurations in your application's source control, and to document deploy processes in your README.


Virtually every application has constants and settings that will need to be changed at convenience. The common ones are hostnames, port numbers, timeouts, module options and errors. It's helpful to keep these values in one place, either in a file or in multiple files if there are enough of them. Doing so makes them quicker to change, without having to spend time combing through the code to track them down.

I used to just dump configuration settings into a file that exported an object of configuration properties. This worked well for a very specific target environment, such as the production environment at the time, but over time it started becoming a maintenance bottleneck that lacked clarity and sprawled into a scribble of conditional statements and multiple files as the infrastructure around the app changed.

We've since had great results from environment-aware configuration. The idea is you can change configuration values based on the environment in which your application is running. The way this works is simple. You export a shell environment variable called $NODE_ENV contianing an identifier for the environment mode you'll be running the app under. Your application will then tailor the configuration settings using those you've defined specially for that environment when it starts up.

Environment-aware configuration offers you more leeway throughout the whole application lifecycle. You should be able to develop and run your applications locally, without the need for internet connectivity (you want to be able to hack on trains right? In fact I'm writing this on a train right now ;)). That will require pointing hosts and ports to local services. Then, you'll want to be able to test-drive on a staging server before deploying to production. Each of these will likely require different configurations.

We commonly use node-config, a module that's been designed precisely for the job. All you need to do is define your configuration values in config/default.js, and then make a file for each of your various environments, which contain directives that will extend the defaults in default.js. You set the $NODE_ENV variable with the environment name, and the module will override the defaults.js with the properties defined in [$NODE_ENV].js. To import the merged configuration object into your app, you simply require() it.

Logs, Metrics and Monitoring


You'll want to give yourself enough evidence to work with should your application misbehave, so you can shoot through from 'b0rked' to 'fixed all of the things' status in as little time as possible. One of the best (and old skool) ways to do this is trusty old logging. General premise is, if you get an error, log the bugger. You should be following node's error handling convention, where the first argument of a callback is reserved for error information should one occur:

makeRyanDahlProud(function(err, result){

How you react to that error, however, is up to you. You may want to log it and carry on. Or you may want to halt execution of that callback. Regardless of your needs, you should have a way of referencing that error in the future, and logging is a simple way to achieve that.

Although logging errors is good practise, it can potentially lead to a lot of messages being sent to the logs / terminal. The mighty TJ Holowaychuck has developed a module called debug that allows you to namespace log messages so you can later filter a signal from the noise by glob-matching these log message namespaces. TJ has plenty of other handy modules in his repertoire.


Application metrics offer valuable insight into what your application is doing and how often. It serves as a great way to detect unexpected activity, spot bottlenecks and as a point of reference for scaling plans. I put together a simple module called abacus which helps you maintain a collection of counters and optionally flush them to graphite via statsdfor plotting visualisations. This has proved exceptionally handy for ensuring the application is behaving within intended operational parameters.


Not always necessary from the get-go, but it's usually a good idea to keep an eye on resource utilisation information from the server hosting your application. Another early-warning system is useful to have, and it'll help avoid silly reasons for your application to go down. There's nothing more embarrassing than your application breaking because your server ran out of disk space, or you were maxing out the CPU so much it melted.

There's a huge variety of monitoring tools and services out there:GangliaMonitSensu to name a few open source ones, andServerDensityNodeTime and NewRelic as SaaS services.

Fault tolerance

Tying in closely with deployment considerations, you should think about what'll happen if your application crashes. It's best to have the application under the control of a system supervisor, such as upstart on Ubuntu. Configuring upstart is for the most part trivial, and can handle starting, stopping, and restarting the application if it explodes, so it's worth doing. Foreman has an export facility that generates upstart configs for your foreman-backed application.

Even if your application will be rebooted if it crashes, what are the implications of it doing so? Will it take your service down for some time? Will it lose data? Will it leave partially-complete work? These are consequences you must architect around, and redundancy is a good way to achieve that. For example, if you're round-robining traffic across a number of servers, consider adding a reverse-proxy intermediary (a load balancer like HAProxy or web server such as nginx) to remove the faulty instance of your app from load balancing until its health checks start passing again.

Efficiency & Scaling

Rarely an early-stage concern, but as your application matures and handles a high workload, you may need to think about making the app more efficient, or even scaling it. The risk here is premature optimisation. You shouldn't worry about making your app super-scale or super fast early on, because let's face it, before it actually does get high load, why would you bother? Your precious time is much better spent on building the essential featureset, or 'minimally viable' as the lean startups like to call, at least to get it to the stage where you might need to scale.

When you do hit that stage where a single node with a single instance of your app is not enough, you have several options open to you, all at the mercy of trade-offs which can make it feel like a bit of a minefield. The main priority is to seek out the bottleneck. Why is the application not fast enough? Is it maxing out CPU? Is disk I/O sufficiently large or consistent enough?

Sometimes the easiest answer is to stick the app on a bigger server with more resources. If this'll work for you, then it'll get the job done quicker than re-architecting the app to work in a multi-node setup, but if you're growing really fast then it's not the most sustainable solution. This method might buy you enough time while you are designing your retaliation, however.

Scaling the app horizontally across multiple servers is tricky and introduces lots of scope for failure, but carries long-term viability and a fascinating technical challenge.

Docs & Team Collaboration

An application without documentation is like flat-pack without instructions. You can kind of figure out what it's supposed to do, but doing so is clumsy, time consuming and imprecise. It's much better to provide clear documentation that succinctly complements your code. You're not only going to help others get up and developing quickly, but also salvage your forgetful self when you return to the app in 6 months time to fix an obscure bug.

I'm not advocating writing essays or tautological assertions that can otherwise be discerned from the code. There's a lot that code can explain itself when it's written clearly and simply. Instead, your documentation should colour in the grey areas that the code cannot clearly convey. Comments should help explain trickier portions of functionality, as well as inform about design decisions, trade-offs, dependencies, pitfalls, edge-cases and other considerations made. As an application matures, the documentation:code ratio should increase, to reflect its journey towards stability from transience. There's no point going overboard on the documentation when the app is in its infancy. It will change rapidly in the early stages, and you'll sink a load of time into documentation that becomes obsolete in a short time.

Every application should include a README[.md] which contains all the need-to-know essentials of working with the app. Commonly this comprises:

  • A brief description of the app and its purpose

  • Setup instructions

  • Booting instructions

  • Testing instructions

  • Deployment considerations

  • Any other need-to-know pointers

We built a little module that can extract source code comments and generate clean, attractive documentation which reads parallel to the code. It's called docker and we use it in most of our main apps.


When I was starting out, I never realised the importance of testing and never really bothered. Perhaps it's because I'd never been in the situation where, years down the line, your application starts failing and you've no idea how to guarantee which components were working properly and which weren't. Yeah, well, now I've had that experience, it's not pretty. You need tests.

A rigourous attitude to testing encourages good application design paradigms. You are required to think laterally to compartmentalise components of your app and make it possible to test them individually. This goes beyond basic unit tests which tend to be unnecessarily pedantic, to more informative component and integration tests where you can write probes to ensure your application is consistently fused together properly.

A good practise is to write tests as you develop your application (once you're more confident that the functionality you're testing is not so much in flux) so that as you progress with the app, you have an ever increasing battery of tests on hand to continuously run. This helps you guarantee you don't break functionality with new developments (regressions). Tests also serve as a great usage example for other developers to understand exactly how various parts of the app are supposed to behave and what the results should be.

There's a variety of testing frameworks to choose from employing a range of different techniques (BDDTDD). Amongst others, we've triedvows and tap to date but my favourite so far is mocha in conjunction with should.js which I feel hits the right balance of structure and tooling vs pure javascript, allowing you to use all the same source libraries, script files and boot servers from your tests as you would running the app.


Node includes a powerful module system and a package manager called "npm" which is designed to help you seamlessly integrate modules into your app. npm endows you with a wealth of open source modules archived in its index where you're likely to find what you're looking for.

Once you figure out what components your app might need (such as a client with which you'll communicate with a redis database), it's wise to first check userland node modules in the index for existing implementations. Node.js is one of the hacker's favourites for experimentation, so there's a huge range of modules available for you to try out. Naturally, the quality fluctuates dramatically, and you'll need to build up a sense for what makes a good module and what doesn't (shortlist: README, tests, examples, prestiged author), but generally there's almost always something available for what you require.

Error handling in Node.js

Exploring error handling methods for Node.js apps

Error handling can be a drag, but it's essential for the stability of your app. Naturally, I'm interested in ways to streamline the error handling process to make it as stable as it can be for the app whilst also being convenient for me to write.

Lately, I've been reading a lot about new features in ES6, such as generators, and that's led me onto promises, an alternative method of asynchronous flow control to callbacks. I decided to look into the differences in how these different methods approach error handling, their strengths and weaknesses.

I/O errors: callbacks vs promises

The main kind of errors we're looking at here are I/O errors in asynchronous operations. These occur when an I/O operation fails to yield the expected results, sometimes due to some external problem outside of your program's control. For example, we might be fetching data from a MySQL database, but our query contains an error:

// We misspell 'SELECT' in this query so it fails
var query = 'SLECT 1 + 1';
con.query(query, function(err){
// query failed. what do we do now?


Notice that in this example we are using Node's default style of using a callback to handle the result of the I/O. The first argument of the callback function is err. This is the standard convention in Node, the one you should follow when writing your own async functions.

The first argument to callbacks should always be err

Developers new to Node sometimes forget to follow this convention which makes it very frustrating for other developers trying to work with their code, because they have no consistent point of reference to check whether the operation succeeded or not. But if the first parameter to our callback is reserved for errors then they can be checked before processing the results of each callback.

If err is falsy (usually null), then the callback can carry on assuming the operation succeeded. Otherwise, it can deal with the error in an appropriate way, such as logging it along with any contextual information. It can then decide whether or not to carry on depending on the severity of the error or whether or not the resultant data is required to continue operation.

Let's implement some error handling for our query error:

var log = console.log;
// We misspell 'SELECT' in this query so it fails
var query = 'SLECT 1 + 1';
con.query(query, function(err){
if (err) return log("Query failed. Error: %s. Query: %s", err, query);

Here, we check if err is present. If it is, we log the error and the query that triggered it then return from the function, stopping it from running any further.


You might have a collection of multiple async operations executing in parallel. How do we handle errors in any of those?

Our favourite library for asynchronous flow control is async. Bothasync.parallel and async.series accept a collection of operations, and if any of them pass an error to its callback, async will immediately invoke your completion callback with the error:

var async = require('async');
var log = console.log;
var op1 = function(cb) {
// We misspell 'SELECT' in this query so it fails
var query = 'SLECT 1 + 1';
con.query(query, cb);

var op2 = function(cb) {
// This query is fine
con.query('SELECT 1 + 1', cb);

var ops = [op1, op2];

async.parallel(ops, function(err, results) {
if (err) return log("Something went wrong in one of our ops. Err: %s", err);

// Otherwise, process results

async.parallel will execute both op1 and op2 in parallel but if either or both fail it will invoke our completion callback with the error that occurred first.

Standard callbacks are all well and good when we're following Node's convention, but it's a little bit laborious to check the result of every operation, and this can quickly get messy when there are many nested callbacks each with their own error handling code.


Promises are an alternative to callbacks for asynchronous control flow. They are viewed as a solution to the "pyramid of doom" indentation caused by nested callbacks, but they also have some useful error handling features.

Q is a popular module to get you working with promises. In itsREADME, Q describes the concept of promises:

If a function cannot return a value or throw an exception without blocking, it can return a promise instead. A promise is an object that represents the return value or the thrown exception that the function may eventually provide.

Promises allow us to chain operations together in a sequential manner, without the need for nesting. They also neatly encapsulate any results and errors raised within the chain. For example:

var Q = require('Q');

.catch(function(err) {
// An exception was thrown in any of the ops
log("Something went wrong");

Compare this to the callback-based equivalent:

op1(function(err) {
if (err) {
return log('Something went wrong');

op2(function(err) {
if (err) {
return log('Something else went wrong');

✔ Clear & compact

The promises method is much more compact, clearer and quicker to write. If an error or exception occurs within any of the ops it is handled by the single .catch() handler. Having this single place to handle all errors means you don't need to write error checking for each stage of the work.

✔ Exception handling

Additionally, the promise chain has more robust protection against exceptions and runtime errors that could be thrown within operations. If an exception is thrown, it will be caught by your .catch() handler, or any intermediary error handlers passed to each .then step. In contrast, the callback method would crash the node process because it doesn't encapsulate exceptions thrown in I/O callbacks. Catching exceptions like this allows you to gracefully handle the error in an appropriate way instead of crashing the process straight away.

✔ Better stack traces

Furthermore, you can use Q's long stack support to get more helpful stack traces that keep track of the call stack across asynchronous operations.

✘ Compatibility

One slight disadvantage of promises is that in order to use them, you need to make any normal node callback-style code compatible with promise flow control. This usually involves passing the functions through an adapter to make it compatible with promises, such as Q'sQ.denodify(fn).

Error events

We've spoken a lot about I/O errors during asynchronous flow control, but Node.js has another way of running handlers asynchronously: events.

In Node, an object can be made into an event emitter by inheriting the EventEmitter on its prototype. All core node modules that emit events such as net.Socket or http.Server inherit from EventEmitter.

The special 'error' event

When an event emitter encounters an error (e.g. a TCP socket abruptly disconnects), it will emit a special 'error' event. The 'error' event is special in Node because if there are no listeners on the event emitter for this event then the event emitter will throw an exception and this will crash the process.

Handling uncaught exceptions

You might be tempted to prevent exceptions being thrown by binding a listener to the 'error' event and logging it instead of crashing. This is potentially dangerous, because you usually can't guarantee exactly where the error originated from and what all the consequences of it are. Usually the best thing to do is catch the error, log it, close any existing connections and gracefully restart your app.

Do not use process.on('uncaughtException')

process.on('uncaughtException') was added to node for the purpose of catching errors and doing cleanup before the node process exits. Beware! This has quickly become an anti-pattern in Node. It loses the context of the exception, and is prone to hanging if your event handler doesn't call process.exit().


Domains were created as a more official way of capturing uncaught exceptions or stray 'error' events before they get to crash the process.

While the domains API is still in unstable state and is subject to some criticism, it is still better than using process.on('uncaughtException').Read the docs for usage info.

Further reading on Q module and domains documentation






Admin and User Account Systems in Node.js

Saying goodbye to PHP

Admin Systems in Node.js

At GoSquared, we recently migrated our internal admin and user account systems from PHP to Node.js. It brought up some interesting challenges, very different to those that we're usually faced with, likeoptimising Node.js for heavy workloads.

Due to it's asynchronous nature, user management and admin applications aren't usually written in Node.js – but it is possible to build a stable, manageable and flexible system.


Using classes is the nicest way to provide an accessible API for your other apps. Everybody will have a User class, and all classes should inherit from a Base class which provides common methods, such assetAttributegetAttributeinsertdelete – whatever your system requires.

var util = require('util');
// assume the Base class has common methods already written
var Base = require('./Base');

var User = module.exports = function(id) {
var self = this; = 'User';
Base.apply(this, arguments);
self.table = 'users';

util.inherits(User, Base);

User.prototype.create = function(details, cb) {
var self = this;

self.insert(details, function(err) {
if (err) return cb(err);

// other user creation methods, such as subscribing
// to a mailing list etc. etc.

Note: with ES6 you can use the new classes if you prefer


To avoid getting into callback hell, async is an essential node module. Although async can help avoid issues, some deep callback nesting is unavoidable and should be accepted – so long as larger functions are split into smaller methods. Using smaller methods and then connecting them together with async is easy and can lead to very clean and readable code. A great benefit of writing asynchronous code is that tasks can be done in parallel, making functions significantly faster in places – we were able to get a few over 20x faster after the migration.

Assuming the methods getAttributes and getSubscription already exist…

The Wrong Way

User.prototype.getDetails = function(cb) {
var self = this;
self.getAttributes(['name', 'email'], function(err, attributes) {
if (err) return cb(err);

self.getSubscription(function(err, subscription) {
if (err) return cb(err);

attributes.subscription = subscription;
cb(null, attributes);

The Right Way (cleaner and faster)

User.prototype.getDetails = function(cb) {
var self = this;

self.getAttributes.bind(self, ['name', 'email']),
], function(err, res) {
if (err) return cb(err);
var attributes = res[0];
attributes.subscription = res[1];
cb(null, attributes);

Cron Jobs

Most, if not all, admin systems will require tasks to be run on a regular schedule. We use node-cron to avoid having to maintain crontabs.

As these systems are designed to be distributed across any number of servers, we use locking (via redis) to ensure the cronjobs are only run once. Larger jobs that, for example, loop over every user, are split into individual QP jobs and processed evenly by all servers.

Usage by other applications

All production applications at GoSquared are written in Node.js, which is great for maintainability. In the past, all admin functions were called via a secure HTTP API which had additional overhead and wasn't terribly flexible – so much so that in places raw MySQL queries were being used rather than the API.

The solution is to use the admin system as a node module, which exports the classes and and utilities/handlers. This can then be required by any applications and the methods can be called directly – making everything quicker and significantly more versatile. Create anindex.js (or equivalent) file that can export what's needed…

var fs = require('fs');
var Admin = module.exports = {};
var classes = fs.readdirSync(__dirname + '/src/classes').sort();

classes.forEach(function(c) {
if (c.substr(-3) !== '.js') return;

// take off JS
var name = c.slice(0,-3);

// don't export the Base class
if (name === 'Base') return;

// define as a getter to improve performance
Object.defineProperty(Admin, name, {
get: function() {
return require(__dirname + '/src/classes/' + name + '.js');
enumerable: true

Then, in your applications, it will need to be added as a dependency (use git tags for versioning) and can be used like so…

// pretend we're serving an HTTP request to get a user's details
var admin = require('admin');
var User = admin.User;

// app is already defined elsewhere (express)
app.get('/user/:id/details', function(req, res) {
var user = new User(req.param('id'));
user.getDetails(function(err, attributes) {


Although building an admin system in Node.js may not be the obvious choice, it comes with many advantages and has sped up development here at GoSquared. Our applications run faster and are much more flexible, allowing us to optimise them and ultimately provide a better user experience.

Questions? Comments? Let us know via the usual channels – Twitter (@GoSquared), Facebook or email.

Want to work with us and make our user account systems even better? We're hiring a full-stack engineer – get in touch.

Infinite DNS queries for free with Route 53

Serving billions of requests on a budget

Globally, our service handles billions of requests per month. This amounts to hundreds of millions of queries to our authoritative DNS servers, which are run by AWS' Route 53.

Some dedicated DNS providers quote upwards of thousands of dollars for this level of DNS traffic.

How much do all those queries cost us per month? Hundreds? Thousands? More?

Nope. Under $10.

Queries are free with ELB + Route 53

That's right. One amazing advantage of Route 53 is that all DNS queries to a record which is mapped to an AWS alias are free.

An AWS alias can be an Elastic Load Balancer (ELB), a CloudFront distribution, an S3 bucket, or another Route 53 record set within the same hosted zone. So effectively you can easily set up your DNS records such that the vast majority of your DNS queries are hitting records mapped to aliases and get all those queries for free.

Better still, there is no limit!

Rock on, AWS.

Build, run, and scale apps.

Cloud computing designed and built for developers.