Comparing Python performance of Protobuf/Thrift serialization…

When in need to get two software processes to exchange datas, some sort of protocol is necessary to define how to encode/decode the datas to be transported. A large number of serialization formats are available (json, xml, ASN.1…) so as to tackle the cross process/cross programming language datas encoding problem and Python provides a large number of libraries to leverage them…

What distinguishes the solution provided by protocol buffers or thrift is the need to describe the datas to be exchanged in a central schema file written using an easy to read idl language. Such schema file is then compiled so as to provide data representation for a certain programming language. Not all problems will benefit from this approach, but when developping services that needs to be accessed by clients written in different programming languages ( eg Objective C, Java…) we have found that relying on a well defined schema file allows to save a lot of time.

The need for benchmarking

We write a large part of our server side code using Python and some of the projects we support are accessed mainly by native clients over TCP or UDP. Over the years we moved gradually from our home grown custom serialization solution built on top of Python struct module to protocol buffer…

The move to protocol buffer allowed us to cut down development time required to support new types of client to a minimum. We also realized how the reliance of a central schema was valuable in that everybody can vizualize what the datas are.

One thing however we are regretting from the previous home grown solution are the performances, and at the server performances matter tremendously even more if you are using an asynchronous networking framework like twisted.

For a long while we reinssured ourselves observing that a google supported extension module was available, and that deploying it, would allow us to accelerate serialization/deserialization by a factor 10 at least. Deploying such extension module was delayed till reaching stagging development phase, because it is quite cumbersome to do so. You need to build things from source and manage some environment variables in your server processes, to force the use of the implementation it provides.

Once we activated the protobuf extension module on our stagging server we started to observe random crashes of the server processes. It took us time to understand that those crashes were related to the use of such extension module. Well, we should not have underestimated the fact that Google was labelling this extension module as experimental, but here again we assumed that Google playing in a different category than the rest of us they were probably referring to some pretty advanced usecases :(

After all those hurdles, we realized that selecting the proper serialization technology for your projects is a decision that shall not be taken lightly. Thrift provides an obvious alternative to google protocol buffer, but how does its Python implementation performs ? They exist extensive performance benchmarks of java serialization frameworks, but we found nothing similar for Python.

The benchmark

We have published on GitHub, what we consider to be a good basis to compare the various serialization frameworks which one may want to leverage. The repository for the project can be reached here.

The benchmark for now compares the performances of protobuf and thrift serializations for messages defined in the StuffTotest schema . We welcome suggestions to extend such reference schema so as to explore performances variations more in the details or external help so as to cover more serialization frameworks…

Preliminary results

We have published on GitHub a result run obtained on a low end development machine. If we consider performance to be the average in between serialization and deserialization time for a certain message of the schema, Thrift :

outperforms protocol buffers in 75% of the cases.
is stable.
is much easier to deploy (pip install thrift and you are done…)

So there is currently a clear winner to this benchmark. We will be happy to rerun it so as to validate that things have changed.

AmvTek blog

complex web systems

Comparing Python performance of Protobuf/Thrift serialization…

The need for benchmarking

The benchmark

Preliminary results