AmvTek blog

Extending coverage of the Python serializers benchmark

2014-08-01T00:00:00+03:00

Our previous attend to compare performances of Python implementations for protocol buffers and thrift serializations has generated interesting feedback and suggestions. The main request we received was to try to broaden the coverage of the previous benchmark so as to cover the full range of available options…

We are not there yet, but the benchmark now allows comparing 5 differents frameworks :

protocol buffers

thrift

capnp proto (using the pycapnp package)

json (standard library package)

msgpack

Comparing apples with oranges

Following our previous post, several persons suggested us to have a look at the capnp proto serialization system which is very similar in principle to thrift and protocol buffers we compared earlier. This similarity allowed us to get them covered by the previous benchmark in no time, and at first capnp proto performance looked astonishingly good.

We refrained to publish such results however, as something was not looking correct. If we were to believe what was reported, capnp deserialization time was not depending upon the size or type of the messages to be processed. Assuming we had misunderstood how to use the pycapnp extension package we were leveraging, we contacted Jason Paryani who supports it to ask if he could suggest anything.

Jason explained us that our results were not surprising him as with capnp real deserialization will take place only when message inner content is accessed. Jason also observed that our previous approach to time serialization was probably favoring capnp irrealistically as part of the serialization happens when the message content is set.

In short, to allow a fair comparison in between the different frameworks we wanted to cover and not be fooled by implementation choices made by library developers, Jason advised to revise our benchmarking approach replacing :

serialize by construct & serialize

deserialize by deserialize & traverse

The new approach is probably not a very good one if one want to establish the absolute performance of a single framework. For example, the full traversal requirement will unnecessarily harm the deserialization performance figure for json or msgpack library which deliver fully deserialized dictionaries in one shot.

We believe however that by having each benchmark performs same duties we render meaningfull comparisons possibles. We invit however any interested individual to review current benchmark and let us know what could be done to improve fairness of the comparisons we are trying to make.

Results Overview

You may run the benchmarks on your side and send us the results for publication on the GitHub project. The machines we are relying on are low end one… :)

linux 32 bits results

linux 64 bits results

The two kids of the block are thrift and the new entrant msgpack. There is no point in trying to departage those 2 winners as they are not playing in the same category (schema versus schemaless systems…).

Comparing Python performance of Protobuf/Thrift serialization…

2014-07-12T00:00:00+03:00

When in need to get two software processes to exchange datas, some sort of protocol is necessary to define how to encode/decode the datas to be transported. A large number of serialization formats are available (json, xml, ASN.1…) so as to tackle the cross process/cross programming language datas encoding problem and Python provides a large number of libraries to leverage them…

What distinguishes the solution provided by protocol buffers or thrift is the need to describe the datas to be exchanged in a central schema file written using an easy to read idl language. Such schema file is then compiled so as to provide data representation for a certain programming language. Not all problems will benefit from this approach, but when developping services that needs to be accessed by clients written in different programming languages ( eg Objective C, Java…) we have found that relying on a well defined schema file allows to save a lot of time.

The need for benchmarking

We write a large part of our server side code using Python and some of the projects we support are accessed mainly by native clients over TCP or UDP. Over the years we moved gradually from our home grown custom serialization solution built on top of Python struct module to protocol buffer…

The move to protocol buffer allowed us to cut down development time required to support new types of client to a minimum. We also realized how the reliance of a central schema was valuable in that everybody can vizualize what the datas are.

One thing however we are regretting from the previous home grown solution are the performances, and at the server performances matter tremendously even more if you are using an asynchronous networking framework like twisted.

For a long while we reinssured ourselves observing that a google supported extension module was available, and that deploying it, would allow us to accelerate serialization/deserialization by a factor 10 at least. Deploying such extension module was delayed till reaching stagging development phase, because it is quite cumbersome to do so. You need to build things from source and manage some environment variables in your server processes, to force the use of the implementation it provides.

Once we activated the protobuf extension module on our stagging server we started to observe random crashes of the server processes. It took us time to understand that those crashes were related to the use of such extension module. Well, we should not have underestimated the fact that Google was labelling this extension module as experimental, but here again we assumed that Google playing in a different category than the rest of us they were probably referring to some pretty advanced usecases :(

After all those hurdles, we realized that selecting the proper serialization technology for your projects is a decision that shall not be taken lightly. Thrift provides an obvious alternative to google protocol buffer, but how does its Python implementation performs ? They exist extensive performance benchmarks of java serialization frameworks, but we found nothing similar for Python.

The benchmark

We have published on GitHub, what we consider to be a good basis to compare the various serialization frameworks which one may want to leverage. The repository for the project can be reached here.

The benchmark for now compares the performances of protobuf and thrift serializations for messages defined in the StuffTotest schema . We welcome suggestions to extend such reference schema so as to explore performances variations more in the details or external help so as to cover more serialization frameworks…

Preliminary results

We have published on GitHub a result run obtained on a low end development machine. If we consider performance to be the average in between serialization and deserialization time for a certain message of the schema, Thrift :

outperforms protocol buffers in 75% of the cases.
is stable.
is much easier to deploy (pip install thrift and you are done…)

So there is currently a clear winner to this benchmark. We will be happy to rerun it so as to validate that things have changed.

Making good use of random in your python unit tests

2014-06-20T00:00:00+03:00

Writing efficient UnitTest to validate that your code performs as expected is a difficult endeavor. Lot has been written about the benefits of Test Driven development and on how to best approach testing, and lot can be learned reading the available litterature. One thing however that we don’t see often mentionned is that architecting efficient UnitTest is pretty hard and that no tools or testing framework are of much value without a fair understanding of the code base that needs to be tested.

The techniques we will be briefly introducing now are no different. You may use them to impress your colleagues and show them TestSuite you have just written that contains millions of tests. Be aware though that increasing TestSuite test count may not be sufficient to meaningfully change code base coverage.

Basic idea

Assumes you wish to provide tests for an hypothetical func_to_test that looks like so :

def func_to_test(x, y):
    ...
    return result

To proceed with unit testing func_to_test, our first goal is to generate values that optimally covers the expected domain.

Assumes that x and y are float numbers varying in between [xmin, xmax] and [ymin, ymax].

You may generate a range of values for calling func_to_test like so :

import math, itertools

def float_range(vmin, vmax, n):
    "yield n regularily spaced values in between vmin and vmax..."

    s = float(vmax-vmin)/n

    v = vmin
    for i in xrange(n):

        yield v
        v += s

def gen_func_to_test_sample(m):
    "yield at least m tuples covering func_to_test domain..."

    # calculate optimal number of values along each axis
    n = int(math.ceil(math.sqrt(m)))

    # define value range for x and y
    rx = float_range(xmin, xmax, n)
    ry = float_range(ymin, ymax, n)

    # yield regularily spaced tuples covering func_to_test domain
    for t in itertools.product(rx, ry):
        yield t

In this simple case, it would be simpler to use 2 nested loop to generate the values covering func_to_test domain. However if func_to_test number of axis is large, itertools.product allows to keep things manageable.

The basic idea of randomization consists in covering the problem space with randomly generated values. Randomization has 2 benefits over previous approach :

The code to generate values over the problem domain is much simpler.
Test values being irregularily spaced you will not be trapped by singularity.

To generate a random range of values for calling func_to_test you may proceed like so :

import random

def gen_func_to_test_random_sample(m):

    for i in xrange(m):

        yield random.uniform(xmin,xmax), random.uniform(ymin,ymax)

In case you are not familiar with standard library random module we invit you to explore it as it has lot of features to help generating objects covering complex domain…

Be repeatable

By now you shall have understood the basic idea of tests randomization pretty well. What we want is to cover the problem space in an efficient way minimizing the risks of being trapped by singularities…

There is one big problem though with the approach that we take, is that tests suite shall be repeatable. Imagine that one developer reports that he has observed failure of test 100. If test 100 can never be rerun as is our randomized tests suite will generate more confusion than value.

Fortunately, the Mersenne Twister random generator exported by the standard library random module can be initialized so that same random sequences are generated. Let’s modify our sample generator to make use of this :

from random import Random

def gen_func_to_test_random_sample(seed,m):

    random = Random((seed,m))

    for i in xrange(m):

        yield random.uniform(xmin,xmax), random.uniform(ymin,ymax)

We use a dedicated instance of Random to prevent interfering with other thread which may also be in need of random values at the very same moment we are generating the test sequence.

Using same seed value for each run of the tests suites allows to guarantee that same sample sequence will be generated…

TestCase factories

As you have written tests before, by now you shall be asking yourself how to use this large sequence of (random) objects which you have been advised to generate.

The obvious approach would be to write a single test method that iterates over the sample sequence and apply desired assertions on func_to_test results. We advise you against doing so as your test function will prospectively be in need to apply a very large number of assertions and exit without continuing at the first encountered problem.

Instead you can use a factory function which will take care of generating your TestCase like so :

"Your test module"

import unittest
from random import Random

from somewhere import func_to_test

XDOMAIN = (0.0,8.0) # example (xmin,xmax)
YDOMAIN = (2.0,6.0) # example (ymin,ymax)

def gen_func_to_test_random_sample(seed,m):
    "yield random point over func_to_test domain..."

    random = Random((seed,m))

    for i in xrange(m):

        yield random.uniform(*XDOMAIN), random.uniform(*YDOMAIN)

def build_TestFuncTestCase(seed,m):
    "return TestCase class for func_to_test..."

    # test method factory
    def make_test_method(test_point):
        "return func_to_test test..."

        def a_test(self):

            result = func_to_test(*test_point)

            # all your asserts here, see unittest.TestCase documentation...
            self.assertSomethingOn(result)
            ...

         return a_test

    # fill TestCase dict
    count = 0
    dico = {}
    for pt in gen_func_to_test_random_sample(seed,m):

        testname = "test_func_to_test_%i" % count
        dico[testname] = make_test_method(pt)
        count += 1

    return type("TestFuncTestCase",(unittest.TestCase,),dico)

# this TestCase class will be picked up by Test Runner
# it will contain 1024 tests...
TestFuncTestCase = build_TestFuncTestCase("my test suite",1024)

It is our experience that randomization when applicable provides an efficient way forward to unit test your module. This approach can be summarized like this :

Write code that generate repeatable (pseudo random) sequence of objects over your problem domain.
Use a factory function to generate TestCase subclasses with one test method for each object in your test sequence.

Accessing multiple postgres schemas from Django

2014-06-13T00:00:00+03:00

One of the postgres feature we have been laking the most when working with the Django ORM is the lake of direct support for postgres schemas. In the past we tried several roads to explicitely target other schemas than public when creating or accessing the database structures required by our django applications, but those from code approaches were difficult to maintain.

It appears that this problem can be solved quite elegantly by leveraging postgres search_path parameter.

A simple example

Assumes we wish all the tables of our django project to be created in a schema called django and that our project also requires mapping/accessing a few tables in a schema called legacy. This maybe achieved very easily by fine tuning the DATABASES setting.

Let’see 2 differents ways to configure this :

Approach 1, setting search_path at connection time :

We assume that django and legacy schemas already exist in the target database and that the user we use to access it have the necessary permissions on such schemas.

On django side, we will use a search_path connection option so that we land in the correct schema. Two databases will be configured even though we are connecting to the same database.

# your project settings file

DATABASES = {

    'default': {
            'ENGINE': 'django.db.backends.postgresql_psycopg2',
            'OPTIONS': {
                'options': '-c search_path=django,public'
            },
            'NAME': 'multi_schema_db',
            'USER': 'appuser',
            'PASSWORD': 'secret',
    },

    'legacy': {
            'ENGINE': 'django.db.backends.postgresql_psycopg2',
            'OPTIONS': {
                'options': '-c search_path=legacy,public'
            },
            'NAME': 'multi_schema_db',
            'USER': 'appuser',
            'PASSWORD': 'secret',
    },
}

This is a good approach for development as it requires minimum configuration.

If you syncdb against default databases, all tables for the managed models will get created in django schema…

Approach 2, configuring various databases users :

One drawback of the first approach is that the set search_path command will be send from client to server each time a new database connection is established.

To save some milliseconds on connection time, one can preassign the desired search_path to the user used for connection…

Preassigning search_path to database user :

As postgres user in psql shell…

-- user accessing django schema...
CREATE USER django_user LOGIN PASSWORD 'secret';
GRANT appuser TO django_user;
ALTER ROLE django_user SET search_path TO django, public;

-- user accessing legacy schema...
CREATE USER legacy_user LOGIN PASSWORD 'secret';
GRANT appuser TO legacy_user;
ALTER ROLE legacy_user SET search_path TO legacy, public;

Defining DATABASES setting :

# your production project settings file

DATABASES = {

    'default': {
            'ENGINE': 'django.db.backends.postgresql_psycopg2',
            'NAME': 'multi_schema_db',
            'USER': 'django_user',
            'PASSWORD': 'secret',
    },

    'legacy': {
            'ENGINE': 'django.db.backends.postgresql_psycopg2',
            'NAME': 'multi_schema_db',
            'USER': 'legacy_user',
            'PASSWORD': 'secret',
    },
}

That’s all there is to support multiples postgres schemas from Django.

Custom Django Database Router may also be defined to automatically select the correct schema to use…

Improving EventSource browser support

2014-05-20T00:00:00+03:00

We are happy to opensource a Polyfill that we are using extensively, which will let you use the EventSource in all the browsers that matters today.

In case you have not heard about it, EventSource (aka Server Sent Event) is a javascript api part of the html5 suite, which let you efficiently and asynchronously stream a large number of event messages accross a single HTTP connection.

We started to consider using EventSource in the context of a large scale realtime field monitoring system, where users may access web pages that let them vizualize evolution of datas coming from a large number of sensors. With EventSource we can very cleanly make web browsers be updated in realtime of events distributed by a publish/subscribe system like the one provided by Redis or RabbitMQ.

As we started experimenting with EventSource, we realized that it could not be used currently in Internet explorer 8, 9, 10, 11 and most of Android browsers. See this report for details.

We tested various polyfill aiming at widening the support of EventSource and after observing they were not allowing to support some of the browsers we had to target, we decided to build something on our own.

This project is now available on GitHub and we hope it will help raising awareness about this technology.

Making use of twisted coiterate

2014-05-12T00:00:00+03:00

Twisted provides various ways to integrate CPU bound operations or blocking libraries to the reactor. It provides very clean integration path for threading or external processes.

In this post, we describe an under documented alternative, where the long running task is implemented using an iterator that will be consumed directly in the reactor event loop after passing it to coiterate.

Where usable, coiteration allows to completely avoid using threading, bypassing the well known python GIL bottleneck…

Basic idea

Coiteration requires developers to use a divide and conquer strategy to plan their task execution. In python, we will code the task using a generator function or alternatively a class implementing the iterator protocol.

The task will be executed step by step in the reactor event loop after passing the iterator that represents it to coiterate.

Summing integers

Let’s consider a python function which sums the N first integers, N being arbitrary large.

def sum_all_integers_until(N):

    s = 0
    for i in xrange(N):
        s += i
    return s

For very large value of N, calling such function from the same thread as the one inside which the reactor is running, is not a good idea as the event loop will be blocked for as long as this function needs to return…

Summing using coiteration

A first approach

from twisted.internet import reactor
from twisted.internet.task import coiterate

def make_iterator_to_sum_all_integers_until(N):

    s = 0
    for i in xrange(N):
        s += i
        print "Adding %i to result" % i
        yield None # event loop can looks after other things...

    print "result is %i " % s

def sum_all_integers_being_nice_to_reactor(N):

    all_sum_steps = make_iterator_to_sum_all_integers_until(N)
    coiterate(all_sum_steps)

reactor.callLater(0, sum_all_integers_being_nice_to_reactor, 8)
reactor.run()

See gist coiterate01.py

At line 4 a generator function is defined, that return an iterator that will calculate the sum of all integers until a certain value N. At line 17, this iterator is passed to coiterate which will result in such iterator being consumed in the reactor event loop in an optimal way.

Note that this does not make your iterator magically non blocking, as everywhere else in Twisted, developer shall ensure that each iteration is non blocking.

Obtaining the result

If you took the time to run the above sum_all_integer …, you have probably been delighted to see the result being printed in the console.

Retrieving such result to make use of it, requires some additional efforts that will be detailled now.

Let’s first observe that coiterate is a well behaved Twisted citizen. As it is starting an operation (consumption of the argument iterator…) that will take some time to complete, it returns a Deferred. As you may expect, this Deferred will fire when iteration is over.

If we attach a callback function to this Deferred we will not receive our result, but the same iterator that coiterate has consumed. Let’s see a possible solution to obtain a result from the iterator.

from twisted.internet import reactor
from twisted.internet.task import coiterate

def make_iterator_to_sum_all_integers_until(N, context):

    s = 0
    for i in xrange(N):
        s += i
        print "Adding %i to result" % i
        yield None # event loop can looks after other things...

    context['result'] = s

def sum_all_integers_being_nice_to_reactor(N):
    "return Deferred firing calculated sum..."

    def extract_result_cb(ignored, context):
        "return context['result']"

        rv = context['result']
        print "Got result = %s" % rv
        return rv

    context = {}
    all_sum_steps = make_iterator_to_sum_all_integers_until(N, context)

    deferred = coiterate(all_sum_steps)
    deferred.addCallback(extract_result_cb, context)

    return deferred

reactor.callLater(0, sum_all_integers_being_nice_to_reactor, 8)
reactor.run()

See gist coiterate_02.py

Let’s summarize how this code proceeds :

At line 4, we define a generator function which returns an iterator that let us execute our task step by step. As we also need a result from such iterator, we pass it an additional context object which provides a way to “return” any result obtained during iteration.

At line 14, we construct a well behaved python function that returns a Deferred that will fire with the result we are awaiting. Internally this function takes care of all the gory details of constructing the iterator that will be passed to coiterate and extracting the result we need.

Waiting for Deferred…

Meanwhile executing a long running task, it is quite common to have to wait some time until some externals operations complete. Twisted let our coiterable tasks indicate that they shall be paused until a certain Deferred fires. To achieve so, the only thing to do is to yield the Deferred of interest out of the task iterator.

Let’s see how we could have our sum_all_integer… wait 1 second in between each iteration step.

from twisted.internet import reactor
from twisted.internet.task import coiterate, deferLater

def make_iterator_to_sum_all_integers_until(N, context):

    def wait_some_time(t):
        "return Deferred firing after t seconds"

        return deferLater(reactor,t,lambda :"I was paused %.02f seconds"%t)

    def print_pause_cb(msg):
        "callback printing result message..."

        print msg

    s = 0
    for i in xrange(N):

        s += i
        print "Adding %i to result" % i

        d = wait_some_time(1.0)
        d.addCallback(print_pause_cb)
        yield d    # we will be paused until d fires...

    context['result'] = s

See gist coiterate_03.py

At line 4 is the modified generator function that will pause some time in between each step. The wait_some_time function at line 6 could be anything that returns a Deferred.

It is our experience that the yield to wait approach which coiterate allows greatly simplify coding complex tasks with Twisted.

Cancelling coiteration

When requiring clients to wait long time to get the result of a long running operation, we shall expect situations where the client will give up. In such situations, we normally want to cleanup as soon as possible any resources allocated to service such client.

Before showing how this can be achieved in the context of this example, let’s mention that if you need to control your task from the outside to pause it or stop it, you should consider using cooperate instead of coiterate. Like coiterate, cooperate shall be called with an iterator which will be consumed in the reactor event loop. Unlike coiterate that returns a Deferred that fires when iteration is completed, cooperate returns a Task object that can be used to pause or stop the ongoing task…

from twisted.internet import reactor
from twisted.internet.defer import Deferred, CancelledError
from twisted.internet.task import coiterate, deferLater

def make_iterator_to_sum_all_integers_until(N, context):

    def wait_some_time(t):
        "return Deferred firing after t seconds"

        return deferLater(reactor,t,lambda :"I was paused %.02f seconds"%t)

    def print_pause_cb(msg):
        "callback printing result message..."

        print msg

    d = None
    s = 0

    try:

        for i in xrange(N):

            s += i
            print "Adding %i to result" % i

            d = wait_some_time(1.0)
            d.addCallback(print_pause_cb)
            yield d    # we will be paused until d fires...

        context['result'] = s

    except GeneratorExit:

        print "---"
        print "Early termination..."

        # cancel pending Defferred
        if d and not d.called:
            d.cancel()

def sum_all_integers_being_nice_to_reactor(N):
    "return Deferred firing calculated sum..."

    def extract_result_cb(ignored, context):
        "return context['result']"

        rv = context['result']
        return rv

    def suppress_cancel_log_eb(error):
        "trap CancelledError"

        # this suppress UnhandledError warning...
        error.trap(CancelledError)

    context = {}
    all_sum_steps = make_iterator_to_sum_all_integers_until(N, context)

    deferred = Deferred(lambda _:all_sum_steps.close())
    coiterate(all_sum_steps).chainDeferred(deferred)
    deferred.addCallback(extract_result_cb, context)
    deferred.addErrback(suppress_cancel_log_eb)
    return deferred

def main():
    "start summing integers and stop after 3 seconds..."

    def print_result_cb(res):
        "print result if any..."

        if res is not None:
            print "Got result = %s" % res

    # start sum calculation using coiteration...
    d = sum_all_integers_being_nice_to_reactor(8)
    d.addCallback(print_result_cb)

    # schedule cancellation after 3.00 seconds
    reactor.callLater(3.0, d.cancel)


reactor.callLater(0, main)
reactor.run()

See gist coiterate_04.py

At line 4 our generator function was again modified. At line 32, an inner handler block for the GeneratorExit exception was added. This block will be reached in case close is called on iterator objects returned by our generator function. In this block, we are cleaning up any pending deferred that the task may be waiting for.

One would expect that cancelling the Deferred returned by coiterate would automatically close the related iterator, but this is not the case. Let’s modify the sum_all_integer… function for this to happen. At line 60, we construct the Deferred that sum_all_integer… will return, providing it a cancellation function. Such function simply close the iterator returned by the generator function. This helper Deferred is chained to the Deferred that coiterate returns, so that when no cancellation occurs, we get our result…