• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

cargo-crates/H03-May-2022-513,109448,619

ci/H12-May-2018-206144

scripts/H12-May-2018-12694

src/H12-May-2018-4,1893,602

tests/H12-May-2018-2,6462,227

.gitignoreH A D12-May-201897 1312

.travis.ymlH A D12-May-20181.6 KiB6657

BENCHMARKS.mdH A D12-May-20181.9 KiB4436

COPYINGH A D12-May-2018126 42

Cargo.lockH A D12-May-201816.6 KiB383338

Cargo.tomlH A D12-May-2018899 4841

LICENSE-MITH A D12-May-20181.1 KiB2217

MakefileH A D12-May-2018571 3325

README.mdH A D12-May-201816 KiB372312

UNLICENSEH A D12-May-20181.2 KiB2520

appveyor.ymlH A D12-May-20182.2 KiB8372

session.vimH A D12-May-201856 21

README.md

1xsv is a command line program for indexing, slicing, analyzing, splitting
2and joining CSV files. Commands should be simple, fast and composable:
3
41. Simple tasks should be easy.
52. Performance trade offs should be exposed in the CLI interface.
63. Composition should not come at the expense of performance.
7
8This README contains information on how to
9[install `xsv`](https://github.com/BurntSushi/xsv#installation), in addition to
10a quick tour of several commands.
11
12[![Linux build status](https://api.travis-ci.org/BurntSushi/xsv.svg)](https://travis-ci.org/BurntSushi/xsv)
13[![Windows build status](https://ci.appveyor.com/api/projects/status/github/BurntSushi/xsv?svg=true)](https://ci.appveyor.com/project/BurntSushi/xsv)
14[![](http://meritbadge.herokuapp.com/xsv)](https://crates.io/crates/xsv)
15
16Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org).
17
18
19### Available commands
20
21* **cat** - Concatenate CSV files by row or by column.
22* **count** - Count the rows in a CSV file. (Instantaneous with an index.)
23* **fixlengths** - Force a CSV file to have same-length records by either
24  padding or truncating them.
25* **flatten** - A flattened view of CSV records. Useful for viewing one record
26  at a time. e.g., `xsv slice -i 5 data.csv | xsv flatten`.
27* **fmt** - Reformat CSV data with different delimiters, record terminators
28  or quoting rules. (Supports ASCII delimited data.)
29* **frequency** - Build frequency tables of each column in CSV data. (Uses
30  parallelism to go faster if an index is present.)
31* **headers** - Show the headers of CSV data. Or show the intersection of all
32  headers between many CSV files.
33* **index** - Create an index for a CSV file. This is very quick and provides
34  constant time indexing into the CSV file.
35* **input** - Read CSV data with exotic quoting/escaping rules.
36* **join** - Inner, outer and cross joins. Uses a simple hash index to make it
37  fast.
38* **sample** - Randomly draw rows from CSV data using reservoir sampling (i.e.,
39  use memory proportional to the size of the sample).
40* **search** - Run a regex over CSV data. Applies the regex to each field
41  individually and shows only matching rows.
42* **select** - Select or re-order columns from CSV data.
43* **slice** - Slice rows from any part of a CSV file. When an index is present,
44  this only has to parse the rows in the slice (instead of all rows leading up
45  to the start of the slice).
46* **sort** - Sort CSV data.
47* **split** - Split one CSV file into many CSV files of N chunks.
48* **stats** - Show basic types and statistics of each column in the CSV file.
49  (i.e., mean, standard deviation, median, range, etc.)
50* **table** - Show aligned output of any CSV data using
51  [elastic tabstops](https://github.com/BurntSushi/tabwriter).
52
53
54### A whirlwind tour
55
56Let's say you're playing with some of the data from the
57[Data Science Toolkit](https://github.com/petewarden/dstkdata), which contains
58several CSV files. Maybe you're interested in the population counts of each
59city in the world. So grab the data and start examining it:
60
61```bash
62$ curl -LO http://burntsushi.net/stuff/worldcitiespop.csv
63$ xsv headers worldcitiespop.csv
641   Country
652   City
663   AccentCity
674   Region
685   Population
696   Latitude
707   Longitude
71```
72
73The next thing you might want to do is get an overview of the kind of data that
74appears in each column. The `stats` command will do this for you:
75
76```bash
77$ xsv stats worldcitiespop.csv --everything | xsv table
78field       type     min            max            min_length  max_length  mean          stddev         median     mode         cardinality
79Country     Unicode  ad             zw             2           2                                                   cn           234
80City        Unicode   bab el ahmar  Þykkvibaer     1           91                                                  san jose     2351892
81AccentCity  Unicode   Bâb el Ahmar  ïn Bou Chella  1           91                                                  San Antonio  2375760
82Region      Unicode  00             Z9             0           2                                        13         04           397
83Population  Integer  7              31480498       0           8           47719.570634  302885.559204  10779                   28754
84Latitude    Float    -54.933333     82.483333      1           12          27.188166     21.952614      32.497222  51.15        1038349
85Longitude   Float    -179.983333    180            1           14          37.08886      63.22301       35.28      23.8         1167162
86```
87
88The `xsv table` command takes any CSV data and formats it into aligned columns
89using [elastic tabstops](https://github.com/BurntSushi/tabwriter). You'll
90notice that it even gets alignment right with respect to Unicode characters.
91
92So, this command takes about 12 seconds to run on my machine, but we can speed
93it up by creating an index and re-running the command:
94
95```bash
96$ xsv index worldcitiespop.csv
97$ xsv stats worldcitiespop.csv --everything | xsv table
98...
99```
100
101Which cuts it down to about 8 seconds on my machine. (And creating the index
102takes less than 2 seconds.)
103
104Notably, the same type of "statistics" command in another
105[CSV command line toolkit](https://csvkit.readthedocs.io/)
106takes about 2 minutes to produce similar statistics on the same data set.
107
108Creating an index gives us more than just faster statistics gathering. It also
109makes slice operations extremely fast because *only the sliced portion* has to
110be parsed. For example, let's say you wanted to grab the last 10 records:
111
112```bash
113$ xsv count worldcitiespop.csv
1143173958
115$ xsv slice worldcitiespop.csv -s 3173948 | xsv table
116Country  City               AccentCity         Region  Population  Latitude     Longitude
117zw       zibalonkwe         Zibalonkwe         06                  -19.8333333  27.4666667
118zw       zibunkululu        Zibunkululu        06                  -19.6666667  27.6166667
119zw       ziga               Ziga               06                  -19.2166667  27.4833333
120zw       zikamanas village  Zikamanas Village  00                  -18.2166667  27.95
121zw       zimbabwe           Zimbabwe           07                  -20.2666667  30.9166667
122zw       zimre park         Zimre Park         04                  -17.8661111  31.2136111
123zw       ziyakamanas        Ziyakamanas        00                  -18.2166667  27.95
124zw       zizalisari         Zizalisari         04                  -17.7588889  31.0105556
125zw       zuzumba            Zuzumba            06                  -20.0333333  27.9333333
126zw       zvishavane         Zvishavane         07      79876       -20.3333333  30.0333333
127```
128
129These commands are *instantaneous* because they run in time and memory
130proportional to the size of the slice (which means they will scale to
131arbitrarily large CSV data).
132
133Switching gears a little bit, you might not always want to see every column in
134the CSV data. In this case, maybe we only care about the country, city and
135population. So let's take a look at 10 random rows:
136
137```bash
138$ xsv select Country,AccentCity,Population worldcitiespop.csv \
139  | xsv sample 10 \
140  | xsv table
141Country  AccentCity       Population
142cn       Guankoushang
143za       Klipdrift
144ma       Ouled Hammou
145fr       Les Gravues
146la       Ban Phadèng
147de       Lüdenscheid      80045
148qa       Umm ash Shubrum
149bd       Panditgoan
150us       Appleton
151ua       Lukashenkivske
152```
153
154Whoops! It seems some cities don't have population counts. How pervasive is
155that?
156
157```bash
158$ xsv frequency worldcitiespop.csv --limit 5
159field,value,count
160Country,cn,238985
161Country,ru,215938
162Country,id,176546
163Country,us,141989
164Country,ir,123872
165City,san jose,328
166City,san antonio,320
167City,santa rosa,296
168City,santa cruz,282
169City,san juan,255
170AccentCity,San Antonio,317
171AccentCity,Santa Rosa,296
172AccentCity,Santa Cruz,281
173AccentCity,San Juan,254
174AccentCity,San Miguel,254
175Region,04,159916
176Region,02,142158
177Region,07,126867
178Region,03,122161
179Region,05,118441
180Population,(NULL),3125978
181Population,2310,12
182Population,3097,11
183Population,983,11
184Population,2684,11
185Latitude,51.15,777
186Latitude,51.083333,772
187Latitude,50.933333,769
188Latitude,51.116667,769
189Latitude,51.133333,767
190Longitude,23.8,484
191Longitude,23.2,477
192Longitude,23.05,476
193Longitude,25.3,474
194Longitude,23.1,459
195```
196
197(The `xsv frequency` command builds a frequency table for each column in the
198CSV data. This one only took 5 seconds.)
199
200So it seems that most cities do not have a population count associated with
201them at all. No matter—we can adjust our previous command so that it only
202shows rows with a population count:
203
204```bash
205$ xsv search -s Population '[0-9]' worldcitiespop.csv \
206  | xsv select Country,AccentCity,Population \
207  | xsv sample 10 \
208  | xsv table
209Country  AccentCity       Population
210es       Barañáin         22264
211es       Puerto Real      36946
212at       Moosburg         4602
213hu       Hejobaba         1949
214ru       Polyarnyye Zori  15092
215gr       Kandíla          1245
216is       Ólafsvík         992
217hu       Decs             4210
218bg       Sliven           94252
219gb       Leatherhead      43544
220```
221
222Erk. Which country is `at`? No clue, but the Data Science Toolkit has a CSV
223file called `countrynames.csv`. Let's grab it and do a join so we can see which
224countries these are:
225
226```bash
227curl -LO https://gist.githubusercontent.com/anonymous/063cb470e56e64e98cf1/raw/98e2589b801f6ca3ff900b01a87fbb7452eb35c7/countrynames.csv
228$ xsv headers countrynames.csv
2291   Abbrev
2302   Country
231$ xsv join --no-case  Country sample.csv Abbrev countrynames.csv | xsv table
232Country  AccentCity       Population  Abbrev  Country
233es       Barañáin         22264       ES      Spain
234es       Puerto Real      36946       ES      Spain
235at       Moosburg         4602        AT      Austria
236hu       Hejobaba         1949        HU      Hungary
237ru       Polyarnyye Zori  15092       RU      Russian Federation | Russia
238gr       Kandíla          1245        GR      Greece
239is       Ólafsvík         992         IS      Iceland
240hu       Decs             4210        HU      Hungary
241bg       Sliven           94252       BG      Bulgaria
242gb       Leatherhead      43544       GB      Great Britain | UK | England | Scotland | Wales | Northern Ireland | United Kingdom
243```
244
245Whoops, now we have two columns called `Country` and an `Abbrev` column that we
246no longer need. This is easy to fix by re-ordering columns with the `xsv
247select` command:
248
249```bash
250$ xsv join --no-case  Country sample.csv Abbrev countrynames.csv \
251  | xsv select 'Country[1],AccentCity,Population' \
252  | xsv table
253Country                                                                              AccentCity       Population
254Spain                                                                                Barañáin         22264
255Spain                                                                                Puerto Real      36946
256Austria                                                                              Moosburg         4602
257Hungary                                                                              Hejobaba         1949
258Russian Federation | Russia                                                          Polyarnyye Zori  15092
259Greece                                                                               Kandíla          1245
260Iceland                                                                              Ólafsvík         992
261Hungary                                                                              Decs             4210
262Bulgaria                                                                             Sliven           94252
263Great Britain | UK | England | Scotland | Wales | Northern Ireland | United Kingdom  Leatherhead      43544
264```
265
266Perhaps we can do this with the original CSV data? Indeed we can—because
267joins in `xsv` are fast.
268
269```bash
270$ xsv join --no-case Abbrev countrynames.csv Country worldcitiespop.csv \
271  | xsv select '!Abbrev,Country[1]' \
272  > worldcitiespop_countrynames.csv
273$ xsv sample 10 worldcitiespop_countrynames.csv | xsv table
274Country                      City                   AccentCity             Region  Population  Latitude    Longitude
275Sri Lanka                    miriswatte             Miriswatte             36                  7.2333333   79.9
276Romania                      livezile               Livezile               26      1985        44.512222   22.863333
277Indonesia                    tawainalu              Tawainalu              22                  -4.0225     121.9273
278Russian Federation | Russia  otar                   Otar                   45                  56.975278   48.305278
279France                       le breuil-bois robert  le Breuil-Bois Robert  A8                  48.945567   1.717026
280France                       lissac                 Lissac                 B1                  45.103094   1.464927
281Albania                      lumalasi               Lumalasi               46                  40.6586111  20.7363889
282China                        motzushih              Motzushih              11                  27.65       111.966667
283Russian Federation | Russia  svakino                Svakino                69                  55.60211    34.559785
284Romania                      tirgu pancesti         Tirgu Pancesti         38                  46.216667   27.1
285```
286
287The `!Abbrev,Country[1]` syntax means, "remove the `Abbrev` column and remove
288the second occurrence of the `Country` column." Since we joined with
289`countrynames.csv` first, the first `Country` name (fully expanded) is now
290included in the CSV data.
291
292This `xsv join` command takes about 7 seconds on my machine. The performance
293comes from constructing a very simple hash index of one of the CSV data files
294given. The `join` command does an inner join by default, but it also has left,
295right and full outer join support too.
296
297
298### Installation
299
300Binaries for Windows, Linux and Mac are available [from Github](https://github.com/BurntSushi/xsv/releases/latest).
301
302If you're a **Mac OS X Homebrew** user, then you can install xsv
303from homebrew-core:
304
305```
306$ brew install xsv
307```
308
309If you're a **Nix/NixOS** user, you can install xsv from nixpkgs:
310
311```
312$ nix-env -i xsv
313```
314
315Alternatively, you can compile from source by
316[installing Cargo](https://crates.io/install)
317([Rust's](http://www.rust-lang.org/) package manager)
318and installing `xsv` using Cargo:
319
320```bash
321cargo install xsv
322```
323
324Compiling from this repository also works similarly:
325
326```bash
327git clone git://github.com/BurntSushi/xsv
328cd xsv
329cargo build --release
330```
331
332Compilation will probably take a few minutes depending on your machine. The
333binary will end up in `./target/release/xsv`.
334
335
336### Benchmarks
337
338I've compiled some [very rough
339benchmarks](https://github.com/BurntSushi/xsv/blob/master/BENCHMARKS.md) of
340various `xsv` commands.
341
342
343### Motivation
344
345Here are several valid criticisms of this project:
346
3471. You shouldn't be working with CSV data because CSV is a terrible format.
3482. If your data is gigabytes in size, then CSV is the wrong storage type.
3493. Various SQL databases provide all of the operations available in `xsv` with
350   more sophisticated indexing support. And the performance is a zillion times
351   better.
352
353I'm sure there are more criticisms, but the impetus for this project was a 40GB
354CSV file that was handed to me. I was tasked with figuring out the shape of the
355data inside of it and coming up with a way to integrate it into our existing
356system. It was then that I realized that every single CSV tool I knew about was
357woefully inadequate. They were just too slow or didn't provide enough
358flexibility. (Another project I had comprised of a few dozen CSV files. They
359were smaller than 40GB, but they were each supposed to represent the same kind
360of data. But they all had different column and unintuitive column names. Useful
361CSV inspection tools were critical here—and they had to be reasonably fast.)
362
363The key ingredients for helping me with my task were indexing, random sampling,
364searching, slicing and selecting columns. All of these things made dealing with
36540GB of CSV data a bit more manageable (or dozens of CSV files).
366
367Getting handed a large CSV file *once* was enough to launch me on this quest.
368From conversations I've had with others, CSV data files this large don't seem
369to be a rare event. Therefore, I believe there is room for a tool that has a
370hope of dealing with data that large.
371
372