Last year, as part of our work with OceanMind, we found ourselves in need of an AIS parser. AIS (Automatic Identification System) is the system by which commercial ships report their information. Its what enables sites such as https://www.marinetraffic.com to show, for example, where all the ships in the English Channel are right now. AIS messages use a binary format that is very efficient, but which is somewhat tricky to work with (especially since AIS receivers almost invariably convert these messages with an unusual ASCII binary encoding, so even if you have language or library features designed for the kind of bit twiddling these sorts of protocols typically require, you can’t use those directly on the messages in the format in which you’re likely to receive them).
Thanks to international agreements, AIS is effectively mandatory for a high proportion of commercial ships, so it is an important source of information for OceanMind. It’s also fairly high-volume. MarineTraffic reports receiving data from around 175,000 distinct vessels within the last week. Even when a vessel is stationary, its AIS transponder will provide reports at least once every 3 minutes. While vessels are moving their updates are more frequent—one every 10 seconds being normal, rising as high as every 2 seconds when changing course at speed. This means you’re looking at thousands of messages per second, or the order of one hundred million messages per day.
On the one hand, this is not extreme performance territory. People working on high-speed trading systems would probably question my “high-volume” assertion. In the past I’ve worked on systems distributing messages to hundreds of millions of receivers with real-time tolerances measured in the low single-digit nanosecond realm. (Exercise for the reader: can you guess what? There aren’t many systems like that, and it’s one you’ve almost certainly used. If you’ve lived in the UK in the last 20 years you may even have used the system I worked on.) However, it’s enough that the per-message processing costs can have a noticeable financial impact, particularly with consumption-based pricing models for cloud computing.
We went looking for existing .NET AIS libraries, and were unable to find any that could handle all of the message types we needed to be able to process. (AIS defines 27 message types, including varying levels of detail for location reporting, and data about the vessel and its current voyage. We didn’t need to handle every single type, but there were 8 types we did need.) AIS data is typically transmitted over relatively low-bandwidth channels, so it is a very efficient binary protocol but it’s rather idiosyncratic as a result, so it’s not surprising that most libraries don’t cover every single possibility. (Apart from anything else, it can be quite hard to find real examples of some of the more obscure message types.) There were libraries written in other languages that covered everything we needed, so it would have been possible to add preprocessing stages that used these to convert it into some more readily processed format such as JSON, but to keep operational costs low, we decided to write a .NET parser that had the features we needed.
Since we were building our own parser with an eye on low overheads, we decided that this was a good application for the memory efficiency techniques made possible by language additions in C# 7, corresponding runtime optimizations in .NET Core 2, and related specialized library types including
Span<T>. We had already implemented another parser earlier in the year using these techniques and knew that it was possible for a parser using these techniques to process millions of messages per second on a single core. (My recently-published book, Programming C# 8.0, from O’Reilly, has a chapter on these techniques, by the way.)
Performance and copying data
When processing messages in computing, minimizing copies is usually vital to performance. Early in my career, I wrote device drivers for network cards, and the way to achieve the best possible performance was always to make no more than a single copy. You wanted the network card to transfer data arriving over the network directly into the bit of memory from which the application will be processing the message; and when sending data, you wanted the network card to be able to read data directly from whatever memory the message was already in. There are challenges to doing this (especially with incoming messages where you can’t even know which application the message is destined for until you’ve inspected the header at the start of the message), but minimizing copying was always the single most important factor.
There are a couple of reasons why additional copying tanks your performance. One is simply that there’s a limited amount of bandwidth inside the computer for reading and writing data into and out of memory, and if you copy the data from one location to another, you’re burning memory bandwidth that might have been used more productively for something else. More subtly, if the CPU gets involved in any of the copying, this can wreak havoc with its caching mechanisms. The large disparity between the speed at which modern CPU cores can execute instructions and the time it takes to fetch information from a computer’s main memory—typically a difference of 3 orders of magnitude—means that caching is critically important to performance, so anything that reduces cache efficiency is bad for performance.
With garbage collecting runtimes such as .NET, there’s another potential problem with copying: if copies end up in newly allocated heap blocks, you’re creating additional work for the garbage collector. This can be particularly unhelpful if you are hoping to take advantage of parallel processing to improve throughput. I’ve seen systems which had excellent peak throughput thanks to parallelism, but which unfortunately rarely hit those peaks because they spent large amounts of time with all CPU cores stalled while they sat and waited for GCs to finish. Modern GCs are much better at running concurrently with other work than they were a couple of decades ago but even on modern runtimes, workloads that allocate very large numbers of objects tend to limit the extent to which increasing the number of cores will increase your performance.
All of this means that the most popular strategy for processing incoming data is not very performance friendly: the usual practice is to convert the incoming data stream into objects, often allocating large numbers of strings along the way. The results are often highly convenient to work with, but this approach involves large amounts of copying and heap allocation—exactly what we don’t want.
Process data in situ with Span<T> and friends
If you want to wring as much performance as possible from your CPUs, and if you want to maximise opportunities for parallelism, you’ll want to avoid copying where possible. This means working with the data in situ. Ideally, data will arrive in main memory from some peripheral via DMA (Direct Memory Access, i.e., a network card or disk controller dumps the data directly into memory with minimal intervention from the CPU), and then the CPU will process it directly. Instead of deserializing the data into some .NET-friendly object model, we’ll just look at the data where it is. But this is easier said than done.
Span<T> type introduced in .NET Core 2.1, is designed for exactly this sort of in-situ work. It is a
struct (meaning you don’t need to create a heap block to hold one) that refers to data that sits inside something else. Its special power is that it can refer not just to data inside an existing heap block (e.g., some subsection of an array or string), it can also refer to data on the stack, or even in memory entirely outside of the .NET runtime’s control. This makes it ideal for avoiding copies: it is designed to work with data wherever that data already happens to be.
The price of its special powers is that there are restrictions on how you can use a
Span<T>. In cases where it is pointing into the middle of some object on the GC heap, it needs to be able to keep that object alive, and the .NET GC is mostly built around the idea that references point to the start of an object. When something points into the middle of an object, that’s tricker. In fact the GC has always had to cope with such things because array access involves exactly this kind of pointer, but it only supports that sort of ‘interior pointer’ living on the stack. It is not able to process interior pointers that live inside objects that are on the GC heap, and so the big restriction on
Span<T> is that it is a type that can only ever live on the stack. So you can’t declare
class that has a field of type
Span<T>. More subtly, you can’t put one inside most
struct types either because those can end up on the heap either as a result of being boxed, or just because they appear as fields in a class.
.NET defines a special kind of value type for this scenario called a
Span<T> is just such a type. And the thing about a
ref struct is that it can only be used either as a local variable, or as a field of another
ref struct type. The upshot of these restrictions is that a
ref struct will always live on the stack. (There’s a widely-held but mistake belief that all value types live on the stack. This is demonstrably untrue, but it really is true for
ref struct types—that’s their entire purpose.)
Ais.Net and the Hollywood principle
Span<T>‘s in-site nature makes it ideal for high performance processing, it tends to make for somewhat peculiar APIs. If you define a type that represents an AIS message, and which parses the message using
Span<T>, that type itself is going to need to be a
ref struct, meaning it can only live on the stack. The rules that the C# compiler defines to ensure that any
ref struct does not outlive any other
ref struct that it refers to tend to mean you often can’t simply to return one of these things as the result of a function: the compiler might complain that it cannot guarantee that whatever memory a
Span<T> refers to will continue to be available after a method returns. (E.g., what if the
Span<T> referred to something on the stack frame of the function that’s now returning?)
It turns out that the simplest way to avoid running into these kinds of problems is to use what’s sometimes called the Hollywood principal, named for the instructions often given to aspiring actors after auditions: don’t call us, we’ll call you.
For example, the most straightforward way to process messages in Ais.Net is to implement the
INmeaAisMessageStreamProcessor interface. You can then pass this to
NmeaStreamParser.ParseFileAsync, which will read every single Ais.Net message (encoded in NMEA format, which is the ASCII-encoded version of the binary; our code works directly with that format because it’s the form these messages are most commonly available in, and we wanted to avoid the extra copy that decoding the entire message from ASCII to binary up front would entail) from a file, and pass each message in turn to your processor. This method uses .NET’s
System.IO.Pipelines features to read the file efficiently and process its contents in situ.
As you can see from the source for NmeaStreamParser, using
System.IO.Pipelines is a little involved, particularly when dealing with messages that happen to get split across buffer boundaries, as will tend to happen. And the nature of this code is that none of the spans created by this stream parsing could be passed out as return values, because of the restrictive lifetime rules around
ref struct types. But by requiring consumers to pass in an interface that this code can call, we avoid all such problems, because the parser is only ever passing
ref structs into the methods that will use them.
As the InspectMessageType code from our benchmark project shows, it’s pretty straightforward. You get passed in an
ref struct that provides basic information about the message, along with a
NmeaLineParser<byte> providing access to the message payload. If it’s a message type that interests you, you can then wrap it in one of the more specialized decoders such as NmeaAisPositionReportClassAParser The ReadAllPositions class, also from the benchmarking project, shows how to do this with various message types that include location information.
So ultimately, we’re just constructing, say, an
NmeaAisPositionReportClassAParser on top of a span containing the message, and then retrieving data from properties such as
Latitude10000thMins. (There will necessarily be a bit of copying at this point, because we need to extract the data from the message’s somewhat idiosyncratic representation and turn it into a binary number in a form that .NET can work with. But the key is that we only need to read the information once, and you were going to need to do that at least once in order to process it.)
Where things get more tricky is with text properties. Some messages include text, such as the vessel name. The obvious way to represent this in .NET would be a property of type
string, but the problem with that is that it would need to allocate a heap block—strings always live on the heap in .NET. So instead, fields of this kind are representing with properties of type NmeaAisTextFieldParser. As with everything else, this inspects the data where it lies. (This gets a bit weird because the AIS binary format defines its own rather unusual 6-bit text encoding, and then NMEA presents the AIS binary with its own ASCII-based encoding. But our code handles all that for you.) It deliberately doesn’t provide an easy way to get hold of the information in
string form because that would lead people down an inefficient route.
If you really want a .NET string it’s not that hard: you just need to fetch all the ASCII into an array and then pass that to .NET’s
So where does all this use of
ref struct, minimal copying, and going out of our way to avoid strings get us? Well on my desktop, the simplest of the benchmarks is able to inspect every single message in a file with 1 million entries in under 300ms. That only looks at the message types. The one that actually looks at some of the fields is a little slower, at 380ms, but that still corresponds to about 2.6 million messages per second on a single core, and, as the benchmarking shows, although there are obviously some up-front allocations, we have no recurring per-message GC overhead. Not too shabby.