Python Data Records

In a codebase, whenever I find I am using a pattern 2 or more times, I abstract the concept into a module that can be shared; it is DRY (Dont Repeat Yourself) 101. Code sharing instead of copy pasting reduces the code footprint for bugs to be introduced, eliminates a lot of copy paste bugs, reduces the chances for weird behavior due to code that missed an update, and makes making changes to your application safer and less complicated.

But what happens when that shared module becomes useful enough, that you end up copying the whole module from project to project. And every subsequent project you use it in, the code gets refined and refactored and features are added, leaving the implementations in previous projects behind. Well then it is time to build a distributable package.

For a while now at Jornaya, I have been pushing the use of internal python packages, distributed via our internal Pypi hosted by artifactory. It has greatly cleaned up a lot of our workflows. Many of our testing and load testing libraries shared a similar set of code that had been copy and pasted from repo to repo until we build a common testing package. After which, most of the testing repos were simple declarative test suites, which called common functions in order. If we want to change a component that affects all of the other ones, we update the core testing library to change how that component is called and all of our test suites can be updated by only unpinning a dependency. We have used a similar patterns for abstracting data calls away from directly calling boto3 into a package which is now in almost every python repo. As a result, when we updated the common data package to lazily handle pagination using generators, all of our python libraries saw a benefit with little change.

But there are a few code bits that get reused between both my personal work and my work at Jornaya. These code bases have nothing to do with work per say, except the make a task at hand simpler or allow the complex parts to be hidden away. One of these libraries was the functional-pipeline, which was developed out of a fascination of pipes in other functional languages. We started using it at work due to the ease at which it can process large chains of functions and maintain readability.

The next in this line of open source packages is data-records. Shorty after the release of python 3.7 and seeing that __annotations__ now returns more helpful information, we built a base dataclass that could convert a dictionary to a dataclass ensuring that all of the annotated fields are populated appropriately.

A few iterations later, we had a base class that could coerce types using safe conversions. This saved us a lot of headaches when reading a particular schema from DynamoDB and from our athena backups of old dynamo db tables. Athena will only ever return string types because all of the results are passed through csvs first. So no matter the types going in, you come out with strings. Having the base class be able to look at the type hint and look at the string and attempt to convert the string to that type hint allowed us to load all of the data, regardless of source into the same dataclasses and use them throughout the rest of the application safely without incurring runtime type errors.

Once we realized we could get that level of type safety by putting all of the defensive type handling into the base class, that base class quickly got moved from project to project. While Python supports multiple inheritance, the pattern of sharing a base class for the behaviors it offers felt a little weird. What I wanted was a way to use our type coerced data wrappers in similar to how we use the @dataclass decorator.

So I started a repo and started writing documentation. I dislike that dataclasses are mutable by default. It feels like dataclasses were trying to be everything for everybody, meaning solving any one problem meant a lot of configuration. And that is fine for something that is in the standard lib, but I wanted my solution to be simpler and more specialized. I looked at how Elm handles data in Records, completely immutable and typed. Any changes results in a new copy. Parts of a data record can be pattern matched to get fields out. I documented all of the examples of how I wanted this library to behave in markdown using the doctest syntax.

Next I started digging into @dataclass. I knew from talks and general reading that it was doing a sort of “code generation” or “macro programming”, but I didnt know how that was being done in python. After some digging, I found out the whole thing is powered by exec(), which immediately scared me. When learning python or other languages, one of the early things you learn is to avoid the use of eval or exec; yet here it was being used in the stdlib. After getting over the shock, I realized that the scope of the exec was being defined explicitly, disallowing for arbitrary code to be executed and that the templates it was being passed into were pretty tightly being controlled.

I started up pycharm’s automatic test execution and started combining elements of my coerced type base class and the code generation of @dataclass. I generated the init in the same way dataclass does it in frozen, except that I wrapped the assignment in a helper coerce_type function which takes a value and a type hint and attempts to make the value conform to the type hint. I added my from_dict and an additional from_iter to allow the datarecords to be built in a map. I added an ignored **kwargs to the __init__ function to allow extra fields to be safely passed in an ignored.

Slowly the documentation examples began to pass. The most difficult part of the whole endeavor was getting the type hints to work properly at initialization time. When reading in __annotations__ in the dataclass decorator, I got a python type back, so if I had foo: str, I would get <class 'str'>. However when I tried putting that into coerce type during __init__, it would get passed in as "<class 'str'>" and it would fail. I ended getting around this by adding the references to the types in the arguments (arguments are evaluated as part of the function declaration, not during the call of the function) and hid them in private arguments with default values.

Private arugments are arguments that start with __. They do not show up in the signature so it keeps the interface to the class clean. However since they are defaulted to the type_hint type, at execution time of __init__, I still have a reference to the original type hint of the class.

Once I got the private arguments working to persist the type hint past the declaration of the function, the rest came easily. I added .replace and .extract methods that mimicked the behavior of elm records. I continued added documentation, doctests and unit tests until I got 100% converage. I then followed the deploy setup I developed for the functional-pipeline to publish the package to pypi and readthedocs.

Overall the project was a lot of fun. I learned about private arguments in python, and learned more about macro programming in python than I ever cared to learn. If you want to feel safe about the standard lib, I would suggest not looking under the covers.