Rebecca's Journal
Human Readable Archive Format

I follow a few Google engineers on Twitter and one posted the other day about a spec for a human-readable archive format they designed.

In this case, I’m referring to the google/hrx specification. It defines a format of an archive that is human-readable, which is something I almost developed myself. So having a specification already existing helped me a whole bunch.

First Impressions

It looked like a pretty straightforward format, and I determined that it would work perfectly for what I wanted it for - a backing store for static files on my web servers.

The Python Implementation

First Iteration

I initially started in Python 3 - Which I continued using for two versions of this implementation. The first iteration was a bit more hacky, and relied less on clever parsing and more of a brute-force method of parsing. It’s not elegant, but at the time I was very tired. I spent about 6 hours on the initial implementation. It wasn’t anything special and still didn’t handle weird edge cases particularly well. This led me to work a little bit on trying to think of how I would fix these little issues, but then I realized there were more issues than I had originally thought. I decided to build an entire new version from the ground up, this time relying on more consistent parsing tools as well as using more common software design patterns for Python, such as list comprehension.

A comment about Regular Expressions

I used regular expressions to parse the input, since from my experience, regular expressions are significantly faster for parsing bulk text. Since a normal parser could potentially start trying to comprehend an archive within an archive, then crash out with an unexplainable error, I decided to take a more holistic view of the document. This also let me focus far more on the application logic rather than fussing with oddities of the parsing engine I would have had to use. Secondly, I felt that me reinventing a parser would inevitably be slower than a regular expression engine that was implemented and tested for use in production applications. The only major downside was that I had to use regular expressions, but I felt strong enough at those to go for it.

Second Iteration

I walked away from the majority of my brute-force parsing and moved to a parser that was based on regular expressions - moreso than the original. This led me to a couple of expressions that particularly helped me here, which I tended to use. I feel like these made my life easier as well as a little trickier. I wrote an expression to match any valid blocks in the source file. I never really thought about also having to write an analogous invalid block matcher, so I didn’t, which bit me. I have to split the original input on valid blocks, then consider those as invalid blocks. It works, but it’s incredibly hacky. Certainly better than the first iteration. I didn’t benchmark either however this uses more built-ins than my custom code, which lends itself to portability.

I realized that the design of the Python parser was an exercise in futility since the Python ecosystem didn’t really have a way for me to create an in-memory filesystem that I could just use. I’d have to implement large swaths of the application myself anyways, what was the point. I then looked at where it would work best, then I started on Go, rethought it, considered Scala, then lastly I considered, and eventually settled on, JavaScript - Particularly NodeJS.

The NodeJS Implementation

Full disclosure - I had never really used NodeJS before. I knew JavaScript, but very few of the requirements to use it on the server-side. So a lot of this may sound obvious, but it was things that I encountered, or more, the lack of things I encountered. I used to manage .NET applications, and wrote a whole host of applications to patch these services both online and offline. I was used to being able to change the way the language handled importing other files and operating on them. I had become accustomed to designing shims for the use in hot-patching applications. NodeJS didn’t have a great facility to intercept all calls to a specific library then redirect them to another place. I eventually stumbled across a few packages that would let me do this, but more on that later.

The NodeJS implementation was more or less a direct port of the Python second iteration - Using the same heavy regular expressions. It worked about the same as the Python one, so there’s no surprises there. I used the memfs package to provide a fs-compatible result for the output of the parser. At this point, many of the tasks I wanted to do could be accomplished and the code I already wrote would need little, if any, modification to support this new format. Success! or so I thought.

The Big Issue

I’m calling this “The Big Issue” because it was the one thing sitting between me and what I wanted to achieve that I didn’t know how to solve.

I needed to selectively patch require('fs') statements so that only certain ones are patched, but not others. I focused on combing through the sources of express.js - the static file serving features in particular. I was led to the send library. I found that this was the central module I needed to modify, at least for my case. I tried to think of how to patch this, but never could think of a great way to do it. I then realized that I could grab the source of send and tweak it a teeny bit. I added the below code to it, and now it works wonderfully.

module.exports.fsSwap = fsSwap

function fsSwap (toFS) {
  fs = toFS;
}

Easy. This let me call fsSwap on the import, immediately after it was imported. I thought “Success! It works!” - but realized I had achieved little, as a result of tunnel vision. Express was importing send - not me. Oh no. I eventually found a module intercept-require - It let me register an intercept handler for require statements.

Perfect.

I designed one that would intercept any require('send') statements and replace them with my slightly patched version, then load it with the toFS object I wanted. As a result, I achieved what I was going for, and I’m extraordinarily happy I managed to do so. This is actually demonstrated in the express_inject_demo.js file in my hrx.js repository.

My Takeaways

I tend to gravitate towards implementing a tool how I envision it, generally doing things that are uncommon to get there. I frequently use maybe a different tool for the job, but I get things working in the end. I feel like a lot of my own programming experience is so rooted in embedded systems that I immediately think of how I’d implement it on those devices, making the code as compact as possible, as opposed to realizing that computers have performance nowadays, and I can be a bit more verbose.

This is also part of my design personality of wanting to prove something works to myself, even in a very limited capacity, before I even propose the idea to a large group of contributors. I don’t want to just design some spec without an implementation. I like to implement it first. This gives me a good way to ensure that the spec works for me, and that there’s a reference implementation to test any ports or rewrites against. I’d quickly change features in the implementation if I felt it was difficult to adopt.

And finally, I learned NodeJS! Hooray! I had been meaning to for a while, but never really worked on it since pre-1.0.0 versions. A bunch has changed, so it felt like a new language, which really helped me. Maybe I’ll learn new languages and their coding styles by porting a project I’ve already written to this other language, then asking others that are familiar with it to help me review it. It’d help me learn much faster, since I learn best when I’m just handed documentation, resources, and an objective.