Back to Top

Friday, March 28, 2014

On benchmarks


Numbers every programmer should know and their impact on benchmarks

Disclaimer: I don't mean to be picking on the particular organizations / projects / people who I'll mention below. They are just examples of a larger trend I observed.

Sometimes (most of the times?) we forget just how powerful the machines in our pockets / bags / desks are and accept the inefficiencies of the software running on them. When we start to celebrate those inefficiencies, a line has to be drawn though. Two examples:

In 2013 Twitter claimed a record Tweets Per Second (TPS - cute :-)) of ~143k. Lets round that up to 150k and do some back-of-the envelope calculations:

  • Communication between the clients and Twitter: a tweet is 140 bytes (240 if we allow for unicode). Lets multiple the 150k number by 10 (just to be generous - remember that 143k was already a big blip) - we get a bandwidth requirement of 343 MB/sec. Because tweets are going over TCP presumably and ~20% of a TCP connection is overhead, you would need 428 MB/s of bandwidth, about 3.5 gigabit or less than 0.5 of a 10 gigabit connection.
  • On the backend: lets assume we want triple redundancy (1 master + 2 replica) and that the average tweet goes out to 9 subscribers. This means that internally we need to write each tweet 30 times (we suppose a completely denormalized structure, we need to write the tweet to the users timeline also and do all this thrice for redundancy). This means 10 GB/sec of data (13 if we're sending it over the network using TCP).
  • Thus ~100 servers would be able to easily handle the load. And remember this is 10x of the peak traffic they experienced.

So why do the have 20 to 40 times that many servers? This means that less than 10% (!) of their server capacity is actually used for business functions.

Second example: Google with DataStax came out with a blogpost about benchmarking a 300 node Cassandra cluster on Google Compute Engine. They claim a peak of 1.2M messages per second. Again, lets do some calculations:

  • The messages were 170 bytes in size. They were written to 2+1 nodes which would mean ~600 MB/s of traffic (730 MB/s if over the network using TCP).
  • They used 300 servers but were also testing the resiliency by removing 1/3 of the nodes, so lets be generous and say that the volume was divided over 100 servers.

This means that per server we use 7.3 MB/s network traffic and 6 MB/s disk traffic or 6% or a Gigabit connection and about 50% of medium quality spinning rust HDD.

My challenge to you is: next time you see such benchmarks do a quick back-of-the envelope calculation and if it uses less than 60% of the available throughput, call the people on it!

Wednesday, February 05, 2014

Proxying pypi / npm / etc for fun and profit!


Package managers for source code (like pypi, npm, nuget, maven, gems, etc) are great! We should all use them. But what happens if the central repository goes down? Suddenly all your continious builds / deploys fail for no reason. Here is a way to prevent that:

Configure Apache as a caching proxy fronting these services. This means that you can tolerate downtime for the services and you have quicker builds (since you don't need to contact remote servers). It also has a security benefit (you can firewall of your build server such that it can't make any outgoing connections) and it's nice to avoid consuming the bandwidth of those registries (especially since they are provided for free).

Without further ado, here are the config bits for Apache 2.4

/etc/apache2/force_cache_proxy.conf - the general configuration file for caching:

# Security - we don't want to act as a proxy to arbitrary hosts
ProxyRequests Off
SSLProxyEngine On
# Cache files to disk
CacheEnable disk /
CacheMinFileSize 0
# cache up to 100MB
CacheMaxFileSize 104857600
# Expire cache in one day
CacheMinExpire 86400
CacheDefaultExpire 86400
# Try really hard to cache requests
CacheIgnoreCacheControl On
CacheIgnoreNoLastMod On
CacheStoreExpired On
CacheStoreNoStore On
CacheStorePrivate On
# If remote can't be reached, reply from cache
CacheStaleOnError On
# Provide information about cache in reply headers
CacheDetailHeader On
CacheHeader On
# Only allow requests from localhost
<Location />
        Order Deny,Allow
        Deny from all
        Allow from
<Proxy *>
        # Don't send X-Forwarded-* headers - don't leak local hosts
        # And some servers get confused by them
        ProxyAddHeaders Off

# Small timeout to avoid blocking the build to long
ProxyTimeout    5

Now with this prepared we can create the individual configurations for the services we wish to proxy:

For pypi:

# pypi mirror

        Include force_cache_proxy.conf

        ProxyPass         / status=I
        ProxyPassReverse  /

For npm:

# npm mirror

        Include force_cache_proxy.conf

        ProxyPass         / status=I
        ProxyPassReverse  /

After configuration you need to enable the site (a2ensite) as well as needed modules (a2enmod - ssl, cache, disk_cache, proxy, proxy_http).

Finally you need to configure your package manager clients to use these endpoints:

For npm you need to edit ~/.npmrc (or use npm config set) and add registry =

For Python / pip you need to edit ~/.pip/pip.conf (I recommend having download-cache as per Stavros's post):

download-cache = ~/.cache/pip/
index-url =

If you use setuptools (why!? just stop and use pip :-)), your config is ~/.pydistutils.cfg:

index_url =

Also, if you use buildout, the needed config adjustment in buildout.cfg is:

index =

This is mostly it. If your client is using any kind of local caching, you should clear your cache and reinstall all the dependencies to ensure that Apache has them cached on the disk. There are also dedicated solutions for caching the repositories (for example devpi for python and npm-lazy-mirror for node), however I found them somewhat unreliable and with Apache you have a uniform solution which already has things like startup / supervision implemented and which is familiar to most sysadmins.

Wednesday, December 04, 2013

Programming advent calendars for 2013


Programming advent calendars are posts/articles for a particular topic posted daily between the 1st and 24th of December. They are modeled on the advent calendars received by children on some countries which contain 24 doors for the 24 days of advent and behind each door is a piece of chocolate or other surprise which the child gets on the particular day.

Here is the list of programming related advent calendars for 2013 (if you know of more, leave a comment and I'll update the list):

  • 24 ways - "is the advent calendar for web geeks. Each day throughout December we publish a daily dose of web design and development goodness to bring you all a little Christmas cheer"
  • The Perl Advent Calendar 2013
  • Perl 6 Advent Calendar (if you didn't hear - Perl 6 is a completely different language from Perl 5!)
  • SysAdvent (not only) for sysadmins
  • Performance Calendar Learn all about web performance. (Primarily focused on frontend technologies.)
  • UXmas UX is for everyone. Brush up on the topic this December.
  • Go Advent - the go(lang) advent calendar
  • 24 Pull Requests - "a yearly initiative to encourage developers around the world to send a pull request every day in December up to Xmas"

Tuesday, November 12, 2013

Cleaning up Google AppEngine Mapreduce Jobs


Do you use the Google MapReduce library on AppEngine? And do you have a lot of completed tasks which clutter your dashboard? Use the JS below by pasting it into your developer console to clean them up! (use it at your own risk, no warranty is provided :-))

schedule = function() { window.setTimeout(function() { var c = $('a:contains(Cleanup)').first(); if (c.length > 0) {; } else { $('a:contains(Next page)').click(); schedule(); } }, 300); return true; }; window.confirm = schedule; schedule();

Wednesday, August 14, 2013

Capturing your screen on Ubuntu - with sound


Today I have a short script which I cobbled together from Google searches to do screen captures / screen casts with Ubuntu (including audio in so that you can narrate what is going on):

Xaxis=$(xrandr -q | grep '*' | uniq | awk '{print $1}' |  cut -d 'x' -f1)
Yaxis=$(xrandr -q | grep '*' | uniq | awk '{print $1}' |  cut -d 'x' -f2)
avconv -f alsa -i pulse -f x11grab -s $(($Xaxis))x$(($Yaxis)) -i $DISPLAY.0 -r 15 -c:v libx264 -crf 0 -c:a libvo_aacenc -b:a 256k -threads 8 ~/Videos/output.mp4

I found this to work much better than gtk-recordmydesktop, which had lags, especially when "effects" were being drawn on the screen (like bulletpoints sliding in for a presentation or switching between desktops).

Friday, June 28, 2013

Tips for running SonarQube on large / legacy codebases


Crossposted from the Transylvania JUG website.

SonarQube (previously Sonar) is a quality management platform aimed mainly at Java (although other programming languages are supported to a varying degree. Here are a couple of tips to get it working on legacy projects:

  • There is an Ant runner and a standalone runner, it is not mandatory to use Maven (although it is a good idea in general to use it)
  • Look into the analysis parameters to customize it for your code.
  • Give it space and time :-). For reference a ~2 million LOC Java project took 77 minutes to be analyzed on my laptop (an Intel i7) with 4G heap.
  • To avoid having a ton of problems reported and to focus only on new problems, look into the Cutoff plugin
  • Test and coverage reports can be reused, no need to run them twice (once for the CI system and then for SonarQube). Look into reusing existing reports. Also, make sure to use the latest version of JaCoCo when generating profile data.
  • Configure your sonar.exclusions property to ignore code you aren't interested in
  • Raise your sonar.findbugs.timeout property (the default of 5 minutes can be low for large projects)
  • Consider disabling source code related plugins (sonar.scm.enabled, sonar.scm-stats.enabled) if the provider for your SCM has an issue (HG has an issue currently for example with username containing spaces)

Keep your code clean!

Sunday, June 23, 2013

Nested fluent builders


Crossposted from the Transylvania JUG website.

Builders have become commonplace in current Java code. They have the effect of transforming the following code:

new Foo(1, 5, "abc", false);

Into something like


This has the advantage of being much easier to understand (as a downside we can mention the fact that - depending on the implementation - it can result in the creation of an additional object). The implementation of such builders is very simple - they a list of "setters" which return the current object:

public final class FooBuilder {
  private int count = 1;
  // ...

  public FooBuilder count(int count) {
    this.count = count;
    return this; 

  public Foo build() {
    return new Foo(count, //...

Of course writing even this code can become repetitive and annoying, in which case we can use Lombok or other code generation tools. An other possible improvement - which makes builder more useful for testing - is to add methods like random as suggested in this Java Advent Calendar article. We can subclass the builder (into FooTestBuilder for example) and only use the "extended" version in testing.

What can do however if our objects are more complex (they have non-primitive fields)? One approach may look like this:


We can make this a little nicer by overloading the bar / buzz methods to accept instances of BarBuilder / BuzzBuilder, in which case we can omit two build calls. Still, I longed for something like the following:


The idea is that the bar / buzz calls call start a new "context" where we initialize the Bar/Buzz classes. "build" calls end the innermost context, with the last build returning the initialized Foo object itself. How can this be written in a typesafe / compiler verifiable way?

My solution is the following:

  • Each builder is parameterized to return an arbitrary type T from its build method
  • The actual return value is generated from a Sink of T
  • When using the builder at the top level, we use an IdentitySink with just returns the passed in value.
  • When using the builder in a nested context, we use a Sink which stores the value and returns the builder from "one level up".

Some example code to clarify the explanation from above can be found below. Note that this code has been written as an example and could be optimized (like making using a single instance of the IdentitySink, having FooBuilder itself implementing the sink methods, etc).

Implementation of a leaf-level builder:

interface Sink<T> {
  T setBar(Bar bar);

final class Bar {
  // ...
  public static BarBuilder<Bar> builder() {
    return new BarBuilder<Bar>(new Sink<Bar>() {
      public Bar setBar(Bar bar) { return bar; }

class BarBuilder<T> {
  // ...

  protected BarBuilder(Sink<T> sink) {
    this.sink = sink;

  // ...

  public T build() {
    return sink.setBar(new Bar(c, d, fizz));

<p>Implementation of the root level builder:</p>

<pre lang="java" line="1">
class FooBuilder {
  // ...
  public BarBuilder<FooBuilder> setBar() {
    return new BarBuilder(new Sink<FooBuilder>() {
      public Bar setBar(Bar bar) { = bar;
        return FooBuilder.this;

  // ...

Conclusion: Java has some missing features (liked named parameters or the ease of reuse provided by duck-typing). We can work around them however nicely with some carefully crafted code (and we can put repeating code into code generators to avoid having to write it over and over again). In exchange we get a very versatile and good performing cross-platform runtime.