1.xx noindex 2.xx meta.description="data stream scan" 3.xx meta.keywords="compression data scan query" 4.MT 4 5.TL 6DSS: Data Stream Scan 7.AF "AT&T Labs Research - Florham Park NJ" 8.AU "Glenn Fowler <gsf@research.att.com>" 9.AU "Balachander Krishnamurthy <bala@research.att.com>" 10 11.H 1 Abstract 12.I dss 13(data stream scan) is a framework for describing, transforming, 14reading, querying, and writing streams of record oriented data. 15It is implemented as a command and library API. 16The API is extended by DLLs (shared libraries) that define data domain 17specific I/O, type and query functions. 18The goal is to provide a best in class repository for data scanning, 19along with up to date documentation as a side effect of 20coding to the API. 21Numerous large scale network applications have used 22.IR dss , 23significantly reducing the amount of time spent in coding. 24Typical applications run terabytes of data through 25.I dss 26and have constructed custom queries very quickly. 27The key reasons for the success of 28.I dss 29are its unique architecture, 30generic template, and speed. 31In speed, 32.I dss 33compares extremely favorably against 34.BR perl (1), 35the typical recourse in the networking community, and against customized 36C/C++ code written to deal with a single domain dataset as well. 37The range of applications for which 38.I dss 39has been 40successfully applied over the last two years is wide and has involved 41large volumes of 42.I netflow 43data, 44.I BGP 45data, HTTP proxy and server 46logs, and OSPF LSA (Link State Advertisements.) 47Often the use of 48.I dss 49has led to fearless exploration of ideas 50across very large volumes of data. 51 52.H 1 Introduction 53Parsing is a critical part of record oriented data analysis. 54It involves describing the record structure and splitting the 55record fields into a form suitable for the analysis tools. 56Each tool may have its own idiosyncratic record description and field 57identifier syntax and format, 58and some tools may be better at hiding the details than others. 59For example, tools like 60.BR grep (1) , 61.BR awk (1) 62and 63.BR perl (1) 64intermix low and high level definition details within the same program. 65Although relational databases provide high level abstractions and GUI 66interfaces to the end user, a database expert is usually involved with 67loading the data in the first place. 68.P 69It is not uncommon to have many analysis tools, each with an independent 70parser, 71applied to the same data within a single group of analysts. 72This in itself is not a bad situation. 73A relational database may be overkill for an application that needs 74to visit each record only once; 75hand coded programs may be inappropriate for data that is too large to fit 76in memory, or that may benefit from indexed access, or that may undergo many 77record format changes. 78However, subtle parsing differences between independent analyses can make 79comparisons difficult. 80The absense of read or load errors does not necessarily mean that the data 81was read correctly. 82Certainly data access bugs can be fixed as they are discovered, but in the 83worst case this could mean translating the fix to every analysis tool. 84In some cases it may even be impossible to recover from bugs that affected 85intermediate legacy data. 86.P 87.I dss 88provides a data abstraction that can be used to mitigate parsing differences 89between different analysis tools. 90The key is that the abstraction provides efficient (in space and time) IO 91support 92.I and 93a uniform interface for all data. 94The approach has multiple benefits: 95.BL 96.LI 97allows data analysts to focus on analyzing data 98.LI 99provides a target API for data IO coders 100.LI 101third parties familiar with the data abstraction model are 102automatically proficient in new data domains 103.LE 104.P 105.I dss 106did not originate in a vacuum -- it arose out of experience with 107.B BGP 108data analysis in our lab. 109A few in-house 110.B BGP 111tools gained popularity with the analysts. 112Each analyst stepped up with new 113.B BGP 114feeds and file formats: 115Ciso router table dumps, 116.B MRT 117format data, and ad-hoc flat files. 118Each new format prompted a recoding effort, which by this time meant 119changing a handful of related but slightly different implementations 120that had evolved over a four year period. 121Merging the 122.B BGP 123IO and parsing into a unified API was a big time saver, and also uncovered 124several embarassing bugs that had remained hidden for years. 125In addition, by employing best of class coding practices 126(i.e., releasing the inner hackers) 127the merged interface produced an efficient 128.B MRT 129parser that was 10 times faster than the one published by the originators 130of the format. 131By the time the analysts requested help on 132.B NETFLOW 133data the lesson had been learned, and 134.I dss 135was designed and implemented. 136 137.H 1 "Data Abstraction Model" 138.I dss 139partitions data stream access into components that 140form a uniform model for all data. 141The model guides both the implementation and user interfaces. 142It supports independent data methods, each with one or more physical 143formats. 144For example, Cisco router dump and 145.B MRT 146formats for the 147.B BGP 148method, 149and all versions of the Cisco dump format for the 150.B NETFLOW 151method. 152.P 153Each user visible item, whether in the base API or in DLL extensions, 154is placed in a dictionary. 155A dictionary entry has a name and descriptive text suitable 156for inclusion in a 157.BR man (1) 158page. 159This information can be listed by the runtime interfaces, so up to date 160documentation is always available. 161.P 162Figure 1 illustrates the component relationships in the 163.I dss 164model. 165Some components are part of the default 166.I dss 167library, but most are implemented as independent DLLs that 168are selected and loaded at runtime. 169The 170.I dss 171components are described below. 172 173.H 2 "TRANSFORM Components" 174Transforms are generic data independent filters that are applied 175to the raw data. 176They are automatically detected and applied on input, and are selectively 177applied on output, as determined by user and/or data specific options. 178.I dss 179currently supplies compression transforms for 180.IR gzip , 181.IR pzip , 182.I bzip 183and 184.IR compress ; 185delta (\fIfdelta\fP) and encryption support is planned. 186A transform may be implemented within the main 187.I dss 188process 189(e.g., the 190.BR sfio (3) 191.BR gzip (1) 192discipline) 193or as a separate process (e.g., 194.I gzip 195or 196.I gunzip 197connected by a pipe.) 198The current vs. separate process choice may even be delayed until runtime. 199For example, on a multi-cpu architecture with idle processors, running 200.I gunzip 201as a separate process may take less real time than a single process 202implementation. 203.P 204Transforms allow data to be stored in the most efficient (or secure) manner. 205The data need only be converted when accessed. 206Given the volumes of available network data, compression or deltas, 207along with sampling, is essential for archiving. 208 209.H 2 "METHOD Components" 210Methods describe the data records for a specific data domain. 211A method must be specified before 212.I dss 213accesses any data. 214The method data description includes: 215.BL 216.LI 217A dictionary of file storage formats with: 218.BL 219.LI 220An identification function that automatically determines input formats. 221A format may be a different version (\fInetflow\fP versions 1, 5 and 7) or 222a completely different structure (\fIBGP\fP \fIMRT\fP files vs. \fICisco\fP 223router dumps.) 224.LI 225A record read function. 226This function also handles format specific headers and trailers. 227.LI 228A record write function. 229This function also handles format specific headers and trailers. 230.LE 231.LI 232A dictionary of record field types and names that includes the union of 233fields available in all supported formats. 234Fields absent in a particular format will have empty values. 235For example, the 236.I dss 237.B netflow 238method supports all popular netflow formats. 239The same queries can be used on any of the 240.I netflow 241formats. 242The 243.B "source mask" 244field, not available in the version 1 format, is still available for 245version 1 queries, but will have a default value of 0. 246If a new 247.B foo 248field were to be introduced in the version 101 format then it would simply 249be added to the 250.B netflow 251method dictionary but would have an empty value for all but version 101 252(and newer) data. 253.LI 254An optional canonical record that provides access to a C structure 255representation of each record. 256A method that provides a canonical record can simplify ad-hoc C queries. 257The ad-hoc query can be implemented as a 258.I dss 259.B QUERY 260component function. 261This function is called for each record in the input data. 262.LE 263 264.H 2 "TYPE Components" 265A 266.I dss 267type provides functions that convert data between internal and external data 268representations. 269Some types may also have a base type that provides alternate access to the 270same data. 271Non-numeric constants are represented as \f5'...'\fP or \f5"..."\fP quoted 272strings. 273.P 274For example, the 275.B ipaddr_t 276type in the 277.I dss 278.B ip_t 279type library implements an IP address as a 32 bit unsigned integer 280(the base type), and converts between the integer representation and 281quoted dotted-quad and DNS names. 282.P 283A type may optionally provide a match function that overrides the default 284regular expression string match. 285For example, the 286.B ipaddr_t 287type match function does IP prefix matching and the 288.B aspath_t 289type match function does AS path regular expression matching that treats 290AS numbers as individual tokens (as opposed to the inefficient alternative 291of converting the AS path to a string and applying a string regular 292expression match.) 293 294.H 2 "QUERY Components" 295Queries are the user visible part of 296.IR dss . 297They are used to select, filter, and summarize data streams. 298There are two forms of queries. 299Interpreted queries provide named access to the data fields using 300C style operators. 301Dynamic queries provide 302.I UNIX 303style command access to the record stream. 304Queries may be composed in a manner similar to a 305.I UNIX 306pipeline. 307The main difference is that a pointer to the current record, as opposed to 308the actual record data, is passed between query components. 309Interpreted queries are generic and may be used with any method. 310Dynamic queries may be generic or method specific; method specific queries 311produce a diagnostic when used with the wrong input method. 312 313.H 1 "dss Library Interface" 314The library provides two main interfaces for handling data: 315.BL 316.LI 317C style expression interpreter: 318.BL 319.LI 320basic 321.BR number , 322.B string 323and 324.B reference 325types 326.LI 327compile time callout functions for expression optimization and overloads 328.LI 329execution time callout functions for all operators 330.LE 331.LI 332domain specific methods: 333.BL 334.LI 335data IO functions 336.LI 337data types and conversion functions 338.LI 339specific data field types and names 340.LI 341method types for handling domain data in different formats 342.LE 343.LE 344The library applies 345.BR gzip (1) 346and 347.BR pzip (1) 348decompression as needed for all methods. 349.P 350Applications based on the library can: 351.BL 352.LI 353perform IO on the data stream 354.LI 355report information about the data stream 356.LI 357evaluate expressions to filter records from the stream 358.LI 359format output for reports or further processing 360.LE 361 362.H 1 "dss Command Interface" 363The 364.BR dss (1) 365command provides a command level interface to the 366.BR dss (3) 367library. 368.EX 369dss --man 370.EE 371lists the 372.BR man (1) 373page on the standard error. 374.EX 375dss --?method 376.EE 377lists the available methods; methods are implemented in DLLs 378found on sibling directories of 379.BR $PATH . 380.EX 381dss -x foo --man 382.EE 383provides details on the 384.I foo 385method. 386 387.H 1 "Performance" 388Real time for typical 389.I netflow 390data queries. 391.TS 392center allbox; 393b b b b b 394l l n n n. 395Command Window Records Size Time 396interpreted 10 min 1,715,850 - 0m8.27s 397library 10 min 1,715,850 - 0m4.27s 398interpreted 1 day 229,448,460 27Gb 17m34s 399library 1 day 229,448,460 27Gb 9m23s 400ascii pipe 1 day 229,448,460 27Gb 56m43s 401.TE 402 403.H 1 REFERENCES 404.VL 1i 405.LI 406.xx link="../publications/dss-2004.pdf DSS: Data Stream Scan", 407with Balachander Kirshnamurthy. 408.LE 409