1.xx noindex
2.xx meta.description="data stream scan"
3.xx meta.keywords="compression data scan query"
4.MT 4
5.TL
6DSS: Data Stream Scan
7.AF "AT&T Labs Research - Florham Park NJ"
8.AU "Glenn Fowler <gsf@research.att.com>"
9.AU "Balachander Krishnamurthy <bala@research.att.com>"
10
11.H 1 Abstract
12.I dss
13(data stream scan) is a framework for describing, transforming,
14reading, querying, and writing streams of record oriented data.
15It is implemented as a command and library API.
16The API is extended by DLLs (shared libraries) that define data domain
17specific I/O, type and query functions.
18The goal is to provide a best in class repository for data scanning,
19along with up to date documentation as a side effect of
20coding to the API.
21Numerous large scale network applications have used
22.IR dss ,
23significantly reducing the amount of time spent in coding.
24Typical applications run terabytes of data through
25.I dss
26and have constructed custom queries very quickly.
27The key reasons for the success of
28.I dss
29are its unique architecture,
30generic template, and speed.
31In speed,
32.I dss
33compares extremely favorably against
34.BR perl (1),
35the typical recourse in the networking community, and against customized
36C/C++ code written to deal with a single domain dataset as well.
37The range of applications for which
38.I dss
39has been
40successfully applied over the last two years is wide and has involved
41large volumes of
42.I netflow
43data,
44.I BGP
45data, HTTP proxy and server
46logs, and OSPF LSA (Link State Advertisements.)
47Often the use of
48.I dss
49has led to fearless exploration of ideas
50across very large volumes of data.
51
52.H 1 Introduction
53Parsing is a critical part of record oriented data analysis.
54It involves describing the record structure and splitting the
55record fields into a form suitable for the analysis tools.
56Each tool may have its own idiosyncratic record description and field
57identifier syntax and format,
58and some tools may be better at hiding the details than others.
59For example, tools like
60.BR grep (1) ,
61.BR awk (1)
62and
63.BR perl (1)
64intermix low and high level definition details within the same program.
65Although relational databases provide high level abstractions and GUI
66interfaces to the end user, a database expert is usually involved with
67loading the data in the first place.
68.P
69It is not uncommon to have many analysis tools, each with an independent
70parser,
71applied to the same data within a single group of analysts.
72This in itself is not a bad situation.
73A relational database may be overkill for an application that needs
74to visit each record only once;
75hand coded programs may be inappropriate for data that is too large to fit
76in memory, or that may benefit from indexed access, or that may undergo many
77record format changes.
78However, subtle parsing differences between independent analyses can make
79comparisons difficult.
80The absense of read or load errors does not necessarily mean that the data
81was read correctly.
82Certainly data access bugs can be fixed as they are discovered, but in the
83worst case this could mean translating the fix to every analysis tool.
84In some cases it may even be impossible to recover from bugs that affected
85intermediate legacy data.
86.P
87.I dss
88provides a data abstraction that can be used to mitigate parsing differences
89between different analysis tools.
90The key is that the abstraction provides efficient (in space and time) IO
91support
92.I and
93a uniform interface for all data.
94The approach has multiple benefits:
95.BL
96.LI
97allows data analysts to focus on analyzing data
98.LI
99provides a target API for data IO coders
100.LI
101third parties familiar with the data abstraction model are
102automatically proficient in new data domains
103.LE
104.P
105.I dss
106did not originate in a vacuum -- it arose out of experience with
107.B BGP
108data analysis in our lab.
109A few in-house
110.B BGP
111tools gained popularity with the analysts.
112Each analyst stepped up with new
113.B BGP
114feeds and file formats:
115Ciso router table dumps,
116.B MRT
117format data, and ad-hoc flat files.
118Each new format prompted a recoding effort, which by this time meant
119changing a handful of related but slightly different implementations
120that had evolved over a four year period.
121Merging the
122.B BGP
123IO and parsing into a unified API was a big time saver, and also uncovered
124several embarassing bugs that had remained hidden for years.
125In addition, by employing best of class coding practices
126(i.e., releasing the inner hackers)
127the merged interface produced an efficient
128.B MRT
129parser that was 10 times faster than the one published by the originators
130of the format.
131By the time the analysts requested help on
132.B NETFLOW
133data the lesson had been learned, and
134.I dss
135was designed and implemented.
136
137.H 1 "Data Abstraction Model"
138.I dss
139partitions data stream access into components that
140form a uniform model for all data.
141The model guides both the implementation and user interfaces.
142It supports independent data methods, each with one or more physical
143formats.
144For example, Cisco router dump and
145.B MRT
146formats for the
147.B BGP
148method,
149and all versions of the Cisco dump format for the
150.B NETFLOW
151method.
152.P
153Each user visible item, whether in the base API or in DLL extensions,
154is placed in a dictionary.
155A dictionary entry has a name and descriptive text suitable
156for inclusion in a
157.BR man (1)
158page.
159This information can be listed by the runtime interfaces, so up to date
160documentation is always available.
161.P
162Figure 1 illustrates the component relationships in the
163.I dss
164model.
165Some components are part of the default
166.I dss
167library, but most are implemented as independent DLLs that
168are selected and loaded at runtime.
169The
170.I dss
171components are described below.
172
173.H 2 "TRANSFORM Components"
174Transforms are generic data independent filters that are applied
175to the raw data.
176They are automatically detected and applied on input, and are selectively
177applied on output, as determined by user and/or data specific options.
178.I dss
179currently supplies compression transforms for
180.IR gzip ,
181.IR pzip ,
182.I bzip
183and
184.IR compress ;
185delta (\fIfdelta\fP) and encryption support is planned.
186A transform may be implemented within the main
187.I dss
188process
189(e.g., the
190.BR sfio (3)
191.BR gzip (1)
192discipline)
193or as a separate process (e.g.,
194.I gzip
195or
196.I gunzip
197connected by a pipe.)
198The current vs. separate process choice may even be delayed until runtime.
199For example, on a multi-cpu architecture with idle processors, running
200.I gunzip
201as a separate process may take less real time than a single process
202implementation.
203.P
204Transforms allow data to be stored in the most efficient (or secure) manner.
205The data need only be converted when accessed.
206Given the volumes of available network data, compression or deltas,
207along with sampling, is essential for archiving.
208
209.H 2 "METHOD Components"
210Methods describe the data records for a specific data domain.
211A method must be specified before
212.I dss
213accesses any data.
214The method data description includes:
215.BL
216.LI
217A dictionary of file storage formats with:
218.BL
219.LI
220An identification function that automatically determines input formats.
221A format may be a different version (\fInetflow\fP versions 1, 5 and 7) or
222a completely different structure (\fIBGP\fP \fIMRT\fP files vs. \fICisco\fP
223router dumps.)
224.LI
225A record read function.
226This function also handles format specific headers and trailers.
227.LI
228A record write function.
229This function also handles format specific headers and trailers.
230.LE
231.LI
232A dictionary of record field types and names that includes the union of
233fields available in all supported formats.
234Fields absent in a particular format will have empty values.
235For example, the
236.I dss
237.B netflow
238method supports all popular netflow formats.
239The same queries can be used on any of the
240.I netflow
241formats.
242The
243.B "source mask"
244field, not available in the version 1 format, is still available for
245version 1 queries, but will have a default value of 0.
246If a new
247.B foo
248field were to be introduced in the version 101 format then it would simply
249be added to the
250.B netflow
251method dictionary but would have an empty value for all but version 101
252(and newer) data.
253.LI
254An optional canonical record that provides access to a C structure
255representation of each record.
256A method that provides a canonical record can simplify ad-hoc C queries.
257The ad-hoc query can be implemented as a
258.I dss
259.B QUERY
260component function.
261This function is called for each record in the input data.
262.LE
263
264.H 2 "TYPE Components"
265A
266.I dss
267type provides functions that convert data between internal and external data
268representations.
269Some types may also have a base type that provides alternate access to the
270same data.
271Non-numeric constants are represented as \f5'...'\fP or \f5"..."\fP quoted
272strings.
273.P
274For example, the
275.B ipaddr_t
276type in the
277.I dss
278.B ip_t
279type library implements an IP address as a 32 bit unsigned integer
280(the base type), and converts between the integer representation and
281quoted dotted-quad and DNS names.
282.P
283A type may optionally provide a match function that overrides the default
284regular expression string match.
285For example, the
286.B ipaddr_t
287type match function does IP prefix matching and the
288.B aspath_t
289type match function does AS path regular expression matching that treats
290AS numbers as individual tokens (as opposed to the inefficient alternative
291of converting the AS path to a string and applying a string regular
292expression match.)
293
294.H 2 "QUERY Components"
295Queries are the user visible part of
296.IR dss .
297They are used to select, filter, and summarize data streams.
298There are two forms of queries.
299Interpreted queries provide named access to the data fields using
300C style operators.
301Dynamic queries provide
302.I UNIX
303style command access to the record stream.
304Queries may be composed in a manner similar to a
305.I UNIX
306pipeline.
307The main difference is that a pointer to the current record, as opposed to
308the actual record data, is passed between query components.
309Interpreted queries are generic and may be used with any method.
310Dynamic queries may be generic or method specific; method specific queries
311produce a diagnostic when used with the wrong input method.
312
313.H 1 "dss Library Interface"
314The library provides two main interfaces for handling data:
315.BL
316.LI
317C style expression interpreter:
318.BL
319.LI
320basic
321.BR number ,
322.B string
323and
324.B reference
325types
326.LI
327compile time callout functions for expression optimization and overloads
328.LI
329execution time callout functions for all operators
330.LE
331.LI
332domain specific methods:
333.BL
334.LI
335data IO functions
336.LI
337data types and conversion functions
338.LI
339specific data field types and names
340.LI
341method types for handling domain data in different formats
342.LE
343.LE
344The library applies
345.BR gzip (1)
346and
347.BR pzip (1)
348decompression as needed for all methods.
349.P
350Applications based on the library can:
351.BL
352.LI
353perform IO on the data stream
354.LI
355report information about the data stream
356.LI
357evaluate expressions to filter records from the stream
358.LI
359format output for reports or further processing
360.LE
361
362.H 1 "dss Command Interface"
363The
364.BR dss (1)
365command provides a command level interface to the
366.BR dss (3)
367library.
368.EX
369dss --man
370.EE
371lists the
372.BR man (1)
373page on the standard error.
374.EX
375dss --?method
376.EE
377lists the available methods; methods are implemented in DLLs
378found on sibling directories of
379.BR $PATH .
380.EX
381dss -x foo --man
382.EE
383provides details on the
384.I foo
385method.
386
387.H 1 "Performance"
388Real time for typical
389.I netflow
390data queries.
391.TS
392center allbox;
393b b b b b
394l l n n n.
395Command	Window	Records	Size	Time
396interpreted	10 min	1,715,850	-	0m8.27s
397library	10 min	1,715,850	-	0m4.27s
398interpreted	1 day	229,448,460	27Gb	17m34s
399library	1 day	229,448,460	27Gb	9m23s
400ascii pipe	1 day	229,448,460	27Gb	56m43s
401.TE
402
403.H 1 REFERENCES
404.VL 1i
405.LI
406.xx link="../publications/dss-2004.pdf	DSS: Data Stream Scan",
407with Balachander Kirshnamurthy.
408.LE
409