Shiny Ideas: The Format of a Flow-Capture Flow File

Awhile ago I had cause to try writing a replacement for FlowScan, a tool that we use at my company to monitor traffic within our datacenters. FlowScan, while very useful, is a little long in the tooth. The problem we were running into is that the application is single-threaded; it would peg one CPU on our processing system but leave the others basically idle. That's not a good place to be in when you're trying to process traffic in real time.

The first big problem I ran into in this experiment was figuring out the format of the files which FlowScan uses as input. FlowScan consumes the files generated by flow-capture, part of the flow-tools package of NetFlow processing utilities. There's not a whole lot in the way of readily-accessible documentation on flow-tools' native file format; the best answer I'd found to date is this post which essentially says "read ftlib.h". Absent any better option I read ftlib.h; the results are presented here that others might benefit.

This analysis is based on flow-tools v 0.66; your mileage may vary with later versions. The bulk of the code for manipulating flow files is housed in two files in the source tree, ./lib/ftlib.h and ./lib/ftio.c. The files created by flow-capture have two parts, a header at the beginning of the file and then a stream of records representing the actual flows. Unsuprisingly, reading a flow file is a two-step process: you first call ftio_init (ftio.c:66) to parses the header information and set up an appropriate ftio structure, then you call ftio_read (ftio.c:851) to read the actual flow records.

So let's start with the header. If you poke around ftio_init you see that all the actual header processing is done via a call to ftiheader_read (ftio.c:2300). The first thing that ftiheader_read does is read an ftheader_gen (ftlib.h:465) called "head_gen" from the beginning of the file. This behavior is invariant; each flow file starts with 4 bytes corresponding to the following structure:

struct ftheader_gen {
  u_int8  magic1;                 /* 0xCF */
  u_int8  magic2;                 /* 0x10 (cisco flow) */
  u_int8  byte_order;             /* 1 for little endian (VAX) */
                                  /* 2 for big endian (Motorolla) */
  u_int8  s_version;              /* flow stream format version 1 or 3 */
};

No magic here. The first two bytes are always the same; they're magic numbers that can be used to verify that you're actually processing a flow file. The next byte tells you the "endian-ness" of the rest of the data in the flow file and the final byte tells you the stream format. As far as I've been able to determine v 0.66 of flow-tools produces v3 streams by default, so everything that follows assumes a v3 stream.

After the first four bytes (which I think of as the "static header") there's a u_int32 referred to as "head_off_d" in the source code. head_off_d contains the offset from the beginning of the file where the flow record information starts. This information can be used to compute the size of the v3 dynamic header (ftio.c:2387):

/* v3 dynamic header size */
len_read = head_off_d - sizeof head_gen - sizeof head_off_d;
len_buf = len_read + sizeof head_gen + sizeof head_off_d;

What this says is that the v3 dynamic header consists of (head_off_d - sizeof head_gen - sizeof head_off_d) bytes starting immediately after head_off_d.

Parsing the dynamic header is pretty straightforward; its essentially a list of fttlv (ftlib.h:395):

struct fttlv {
  u_int16 t, l;         /* type, length */
  char *v;              /* value */
};

Each fttlv consists of a 2-byte integer indicating the type of the record (see ftlib.h:217 for constant definitions), a 2-byte integer indicating the overall length of the record, and then a variable number of bytes containing the actual record data. See ftio.c:2493 for the details of decoding and intepreting each type of record.

The dynamic header is aligned on a 4-byte boundary, so there may be up to 3 bytes of null padding at the end of the dynamic header after all the fttlv structures have been processed.

That concludes the header portion of a flow file. Following the header is the majority of the file, a list of records containing the actual flow data. Parsing these is complicated by a couple of factors:

The data may be compressed via zlib.
The record size varies based on the type of device which originally exported the flow.

The presence/absence of compression can be determined by consulting the parsed dynamic header. Specifically, bit FT_HEADER_FLAG_COMPRESS of the fttlv of type FT_TLV_HEADER_FLAGS will be set if the flow records are compressed. If so you will need to zlib inflat all of the data following the dynamic header.

The size of a record is calculated via ftio_rec_size (ftio.c:2229). Looking at this function you see that this calculation is based on 4 variables:

Stream version (s_version): As noted above I'm assuming v3 streams.
Data(?) version (d_version): This is set during parsing of the FT_TLV_EX_VER record in the dynamic header. This value corresponds to the NetFlow version exported by the original exporting device (usually a router).
Aggregation method (agg_method): This is set during parsing of the FT_TLV_AGG_METHOD record in the dynamic header.
Aggregation version (agg_version): This is set during parsing of the FT_TLV_AGG_VERSION record in the dynamic header. In a well-formed flow file it will always be 2.

You only have to worry about aggregation method if you're dealing with NetFlow v8. If (d_version != 8) then the size of the record is sizeof fts3rec_v<d_version>. If (d_version == 8) then the size of the record is sizeof fts3rec_v8_<agg_method>. Note that these structures also map directly to the underlying sequence of bytes in the file.

Once you've determined the compression status and the format you're dealing with its simply a matter of slurping successive records out of the file.

Sunday, January 11, 2009

The Format of a Flow-Capture Flow File

0 Comments:

Previous Posts