Thespian

MC706.io

Thespian is an actor model framework written in python by GoDaddy. Before I get deep into how it works and its features, I’m going to go over a bit of my history with the Actor Model, to give some background and context on why I love this framework.

I was introduced to the actor model in full when working on my first project at Jornaya. I was tasked with building an Actor System in Scala on Akka. The system was to consume a Kafka stream, where a coordinator would route related events in the stream to event specific actors. Each actor could then have all the related events in memory, significantly reducing round trips to the database. Actors would live as long approximately twice the median related event duration, then when they shutdown, they would save their in memory state out to the database. I was quite a genius solution to how to do you collate data that is related when you have a highly mixed stream. That project eventually got scrapped in favor of some more AWS centric technologies, but I still loved the architecture.

A few months later I was working on scaling a polling app, and we ran into a scaling problem in regards to analytics of poll results and collating the demographics of the people who answered the polls against their answer. There could be hundreds of polls running simultaneously, and we wanted for users who answered the poll to be able to see trends based on location, race, age, gender, ect. The setup was serverless, with the api basically taking the response, serializing it and adding it to an SQS queue, so it could be processed and stored. However they delay in the queue was not fast enough for the preliminary analytics, so we had the API also dump the question, response and demographic data to an actor system, which could keep an analytics representation of each question in memory, and increment internal counters for each demographic.

Since most of the team was python, I was looking for a python actor system. The first I came across was pykka, which by its name I assumed was the Akka for Python. Pykka worked pretty well, was super easy to setup, was easy to read, and defining each Actor and what it did was super simple. Our problem with pykka arose when we tried benchmarking it. We simulated a heavy stream of responses, and it froze up. Turns out all of pykka’s actors are owned by the same python process, which means you run straight up against the GIL. (The GIL in python is the Global Interperter Lock, which basically says only 1 python command can be executed at any given time). So our search continued.

I then found Thespian.py, a project made by GoDaddy to manage their long running processes in the background. It was built entirely in python, but used some other technologies to get around the GIL. For instance, each Actor in a Thespian Actor System is booted as a separate process, and they use TCPIP to communicate between them; allowing for actors to live on different processes on different host machines.

Thespians Actor System was bit harder to setup. It required setting up a base system, what communication protocol it used, what ports things communicated on, and calling to the system was a bit confusing. The documentation, while extensive, was very cryptic and hard to read and navigate. GoDaddy decided to use their own Actor Puns for almost all of the assets of the system, and borrowed terms from other systems such as queues.

After setting up the actor system, the setup of the individual actors begins. The basic setup starts with creating a class that inherits from Thespian.Actor which implements a receiveMessage function with arguemnts message and sender. This receiveMessage method gets called whenever a message is sent to that actor, and you can pull out the message and who sent it. The Base Actor class has a send method which takes an actor address and a message. This way you can map out all of your actors, what they do when they receive messages, and who they forward a message to or respond. To enter this system, at the Actor System level, you can ask or tell a specific actor a message. ask blocks and waits for a response, while tell fires the message into the system and continues.

The advanced usage of Thespian involves typed messages. Instead of extending Thespian.Actor, exten Thespian.ActorTypeDispatcher. Then for each type of message that actor can receive, implement receiveMsg_<MessageType> methods handling that specific case. Define each of the messages as a class, with a __init__ method specifying what is contained in that method. This way, the system is now defined as a set of Actors, and the message types that can be passed between them. Each Actor defines its reactions to each message type.

Some other advanced features we made use of were globalNames and actor Troupes. globalName is a parameter you can pass to createActor which tells the system to use get or create that actor, instead of just creating a new instance of it. This is how we routed all questions of the same id to the same actor, no matter which server it came in on. Actor troupes, on the other hand, allow the system to horizontally distribute the work of an actor up to a certain size. Specified by decorating an Actor class with @troupe, it creates X number of copies of that actor and distributes messages sent to it amongst them. We used this for translating API requests into routes throughout the actor system.

For the poll analytics system, we fronted the API with Falcon, a super fast and lightweight python api framework. Falcon reuses the same resource object across requests, which allows us to maintain entry actor addresses into the system, reducing noise in the logs and network. Once we got the system up and running, it was easily able to handle thousands of poll entries per second, while maintaining analytic output response times under 20ms.

The system was a very clever counting system, that was able to handle everything in memory, allow for no look ups or database connections. For the initial system, The falcon API had 2 endpoints. A /response endpoint that just took POST data, converted the json to a dict, and passed it into the troupe of translator actors and responded 201. The other endpoint, /analytics/{questionid}, does a system.ask which routes through a set of actors to get to the question actor handling that questionId which then dumps its internal memory state to json, and sends that back to the asker.

Below the translator troupe, there was a single “cast” actor, which contained the actor addresses of all of the active question actors. It kept a counter on all of the question actors, which got incremented whenever a question was routed. It also called a ActorWakeupMessage on itself, which is a special type of message that can be sent on a delay. This allowed me to set a heartbeat, that applied exponential decay to all of the counters. Once a counter fell below a threshold, a ActorExitMessage was dispatched to that question actor, which would dump its internal memory state to a database and exit. This prevented the actor processes from running too long if they are inactive.

One of the features of thespian that I have not yet used is the hot swapping of modules. Since once an actor system is booted, all of the code is loaded, and the system can span multiple machines, normally you would have to shut down the whole system to deploy new code. This is not idea to have downtime for deploys, especially if there are separate processes running inside of the same actor system. So Thespian offers an actor hot-swapping system, which allows you to send a message to an actor along with a location of a zip of the new code. Once the Actor is ready, it will Exit, and reload itself with the new code. This allows you to update code modules without having to reboot the entire actor system.

If you are using python 3.5+ I highly suggest using TypeHinting everywhere. It makes it very easy to document your actors as far as what they take and what they send. It also makes it super easy to make sure that you are accessing the correct data and the editor autocomplete as far as messages makes the whole process super easy.

Overall, for system that actors work well in, I would highly recommend Thespian. The Grigori project, which will be releasing soon, uses thespian as the way to do multi-threading and async instead of multithreading, mutiprocessing, or tornado. Compared to tornadio, I find reasoning about the flow of an app super easy. The speed up on Grigori scraping a site of 960 urls went from ~ 30 minutes to under a minute.