Wednesday, December 08, 2010

An Erlang Flow File Parser

You may recall that, a long time ago, I wrote that FlowScan is looking a little long in the tooth. One of the major problems with the package is that it uses a single thread to process flow files serially; on modern servers this means that you have multiple cores sitting idle during during processing. When I was thinking about this problem it occurred to me that writing a FlowScan replacement in Erlang might solve at least part of this problem, since Erlang automagically parallelizes many of its operations across multiple cores.

I'm nowhere near having a full-blown replacement for flowscan yet, but I've finally got around to implementing a set of Erlang modules which solves part of the problem: parsing flow files. Given the interest that my posts on Erlang and flow file processing have generated (relative to the rest of the crap I write) I figured I'd commit the code to the web for posterity.

One of the first challenges I had to deal with in putting these modules together was how to actually structure the code. Parsing any particular file generated by flow-capture turns out to be pretty easy (especially in Erlang), but this is complicated by the fact that there are two different flow file formats, each of which can encapsulate any one of several NetFlow export formats. Were I programming in an object oriented language this problem could be easily solved through the application of the factory pattern in conjunction with appropriate base and derived classes. Alas, Erlang doesn't have classes or inheritance, so that idea was a non-starter.

As I've noted before there's a relative dearth of information related to Erlang best-practices, so I ended up having to do a fair amount of experimentation before I arrived at an approach that was satisfactory. I started with a single, monolithic file, but it got very large very quickly and made it difficult to come up with concise, meaningful names by which to distinguish functions for different format/export versions. This led me to break the code up into format-agnostic and format-specific modules using Erlang's package mechanism1, and then repeat that trick again when it came time to handle different NetFlow export formats. The result is a layout that should look familiar to Perl programmers: a module "X.erl" combined with a subdirectory "X" containing all of X.erl's sub-modules.

The other code structure problem that I encountered was in deciding where to put the v3 I/O routines. I initially placed them in v3.erl, but that led to an awkward circular dependency between v3.erl and v5.erl: the former would delegate record parsing to the latter, which would then call the former to read the data off of disk. That was eventually resolved by moving I/O functions into their own module, io.erl, eliminating the dependency.

So, that's the brief story of my struggle with code structure. I ended up with the following:

  • src
    • buffer.erl: An improved version of the buffer module I originally wrote about here that contains the additional method length/1.
    • flow_file.erl: Format-agnostic functions. This does processing which is common to all format versions, delegating format-specific operations to the appropriate submodule.
    • flow_file.hrl: Header file containing public declarations (records and such) needed to work with the flow_file module. This recursively includes all of the header files for submodules as appropriate.
    • flow_file
      • flow_file.erl: Defines a data type for passing around information about a flow file in a generic fashion. Sorry about the name, couldn't think of anything more appropriate.
      • utils.erl: Generic utility functions that aren't strictly related to parsing flow files.
      • v3.erl: Handles the specifics of parsing v3 format flow files. Delegates to the appropriate submodule when it comes time to read records in a specific NetFlow export format.
      • v3.hrl: Public definitions associated w/ v3 format files.
      • v3
        • io.erl: Handles all filesystem I/O specific to v3 format files.
        • v5.erl: Parses NetFlow export v5 records.
        • v5.hrl: Public definitions supporting processing of NetFlow export v5 records.
  • test

Note that this code only handles format v3 and NetFlow export v5; those are the most widely used versions and what I was interested in working with. It'd be trivial to expand to support other NetFlow versions; the only real work that the v5 code does is defining an appropriate record type for the flow and then populating that record by decomposing a chunk of binary data read from the file.

None of this is rocket science from a code-complexity standpoint, especially since Erlang's bit syntax makes the processing of binary data relatively painless. Parsing the dynamic header of v3 format files is somewhat involved, but is simplicty itself in comparision with the C implementation of the same process from flow-tool's ftio.c. The only other moderately tricky bit is handling compressed flow files. Erlang provides a zlib wrapper, but doesn't provide a mechanism for directly streaming compressed data off of disk. So I had to do a little work in io.erl (which is where the buffer module gets used) to read and decompress compressed flow files in chunks. The alternative would be to slurp in and decompress the entire file at once, a non-starter since flow files are often hundreds of megabytes in size.

And there you have it; I've copiously documented the source files, so the rest should be easy enough to figure out. For my next trick I'm probably going to focus on profiling the code to make the process of reading and processing the files more efficient.


1 Packages are an experimental feature, but seem to work pretty well for the most part. I had some problems with the use of nested includes and edoc doesn't do a great job with resolving references for (sub)packages, but those are minor issues which might very well be attributable to me doing it wrong.

0 Comments:

Post a Comment

<< Home

Blog Information Profile for gg00