1This directory includes some useful codes:
2
31. subset selection tools.
42. parameter selection tools.
53. LIBSVM format checking tools
6
7Part I: Subset selection tools
8
9Introduction
10============
11
12Training large data is time consuming. Sometimes one should work on a
13smaller subset first. The python script subset.py randomly selects a
14specified number of samples. For classification data, we provide a
15stratified selection to ensure the same class distribution in the
16subset.
17
18Usage: subset.py [options] dataset number [output1] [output2]
19
20This script selects a subset of the given data set.
21
22options:
23-s method : method of selection (default 0)
24     0 -- stratified selection (classification only)
25     1 -- random selection
26
27output1 : the subset (optional)
28output2 : the rest of data (optional)
29
30If output1 is omitted, the subset will be printed on the screen.
31
32Example
33=======
34
35> python subset.py heart_scale 100 file1 file2
36
37From heart_scale 100 samples are randomly selected and stored in
38file1. All remaining instances are stored in file2.
39
40
41Part II: Parameter Selection Tools
42
43Introduction
44============
45
46grid.py is a parameter selection tool for C-SVM classification using
47the RBF (radial basis function) kernel. It uses cross validation (CV)
48technique to estimate the accuracy of each parameter combination in
49the specified range and helps you to decide the best parameters for
50your problem.
51
52grid.py directly executes libsvm binaries (so no python binding is needed)
53for cross validation and then draw contour of CV accuracy using gnuplot.
54You must have libsvm and gnuplot installed before using it. The package
55gnuplot is available at http://www.gnuplot.info/
56
57On Mac OSX, the precompiled gnuplot file needs the library Aquarterm,
58which thus must be installed as well. In addition, this version of
59gnuplot does not support png, so you need to change "set term png
60transparent small" and use other image formats. For example, you may
61have "set term pbm small color".
62
63Usage: grid.py [grid_options] [svm_options] dataset
64
65grid_options :
66-log2c {begin,end,step | "null"} : set the range of c (default -5,15,2)
67    begin,end,step -- c_range = 2^{begin,...,begin+k*step,...,end}
68    "null"         -- do not grid with c
69-log2g {begin,end,step | "null"} : set the range of g (default 3,-15,-2)
70    begin,end,step -- g_range = 2^{begin,...,begin+k*step,...,end}
71    "null"         -- do not grid with g
72-v n : n-fold cross validation (default 5)
73-svmtrain pathname : set svm executable path and name
74-gnuplot {pathname | "null"} :
75    pathname -- set gnuplot executable path and name
76    "null"   -- do not plot
77-out {pathname | "null"} : (default dataset.out)
78    pathname -- set output file path and name
79    "null"   -- do not output file
80-png pathname : set graphic output file path and name (default dataset.png)
81-resume [pathname] : resume the grid task using an existing output file (default pathname is dataset.out)
82    Use this option only if some parameters have been checked for the SAME data.
83
84svm_options : additional options for svm-train
85
86The program conducts v-fold cross validation using parameter C (and gamma)
87= 2^begin, 2^(begin+step), ..., 2^end.
88
89You can specify where the libsvm executable and gnuplot are using the
90-svmtrain and -gnuplot parameters.
91
92For windows users, please use pgnuplot.exe. If you are using gnuplot
933.7.1, please upgrade to version 3.7.3 or higher. The version 3.7.1
94has a bug. If you use cygwin on windows, please use gunplot-x11.
95
96If the task is terminated accidentally or you would like to change the
97range of parameters, you can apply '-resume' to save time by re-using
98previous results.  You may specify the output file of a previous run
99or use the default (i.e., dataset.out) without giving a name. Please
100note that the same condition must be used in two runs. For example,
101you cannot use '-v 10' earlier and resume the task with '-v 5'.
102
103The value of some options can be "null." For example, `-log2c -1,0,1
104-log2 "null"' means that C=2^-1,2^0,2^1 and g=LIBSVM's default gamma
105value. That is, you do not conduct parameter selection on gamma.
106
107Example
108=======
109
110> python grid.py -log2c -5,5,1 -log2g -4,0,1 -v 5 -m 300 heart_scale
111
112Users (in particular MS Windows users) may need to specify the path of
113executable files. You can either change paths in the beginning of
114grid.py or specify them in the command line. For example,
115
116> grid.py -log2c -5,5,1 -svmtrain "c:\Program Files\libsvm\windows\svm-train.exe" -gnuplot c:\tmp\gnuplot\binary\pgnuplot.exe -v 10 heart_scale
117
118Output: two files
119dataset.png: the CV accuracy contour plot generated by gnuplot
120dataset.out: the CV accuracy at each (log2(C),log2(gamma))
121
122The following example saves running time by loading the output file of a previous run.
123
124> python grid.py -log2c -7,7,1 -log2g -5,2,1 -v 5 -resume heart_scale.out heart_scale
125
126Parallel grid search
127====================
128
129You can conduct a parallel grid search by dispatching jobs to a
130cluster of computers which share the same file system. First, you add
131machine names in grid.py:
132
133ssh_workers = ["linux1", "linux5", "linux5"]
134
135and then setup your ssh so that the authentication works without
136asking a password.
137
138The same machine (e.g., linux5 here) can be listed more than once if
139it has multiple CPUs or has more RAM. If the local machine is the
140best, you can also enlarge the nr_local_worker. For example:
141
142nr_local_worker = 2
143
144Example:
145
146> python grid.py heart_scale
147[local] -1 -1 78.8889  (best c=0.5, g=0.5, rate=78.8889)
148[linux5] -1 -7 83.3333  (best c=0.5, g=0.0078125, rate=83.3333)
149[linux5] 5 -1 77.037  (best c=0.5, g=0.0078125, rate=83.3333)
150[linux1] 5 -7 83.3333  (best c=0.5, g=0.0078125, rate=83.3333)
151.
152.
153.
154
155If -log2c, -log2g, or -v is not specified, default values are used.
156
157If your system uses telnet instead of ssh, you list the computer names
158in telnet_workers.
159
160Calling grid in Python
161======================
162
163In addition to using grid.py as a command-line tool, you can use it as a
164Python module.
165
166>>> rate, param = find_parameters(dataset, options)
167
168You need to specify `dataset' and `options' (default ''). See the following example.
169
170> python
171
172>>> from grid import *
173>>> rate, param = find_parameters('../heart_scale', '-log2c -1,1,1 -log2g -1,1,1')
174[local] 0.0 0.0 rate=74.8148 (best c=1.0, g=1.0, rate=74.8148)
175[local] 0.0 -1.0 rate=77.037 (best c=1.0, g=0.5, rate=77.037)
176.
177.
178[local] -1.0 -1.0 rate=78.8889 (best c=0.5, g=0.5, rate=78.8889)
179.
180.
181>>> rate
18278.8889
183>>> param
184{'c': 0.5, 'g': 0.5}
185
186
187Part III: LIBSVM format checking tools
188
189Introduction
190============
191
192`svm-train' conducts only a simple check of the input data. To do a
193detailed check, we provide a python script `checkdata.py.'
194
195Usage: checkdata.py dataset
196
197Exit status (returned value): 1 if there are errors, 0 otherwise.
198
199This tool is written by Rong-En Fan at National Taiwan University.
200
201Example
202=======
203
204> cat bad_data
2051 3:1 2:4
206> python checkdata.py bad_data
207line 1: feature indices must be in an ascending order, previous/current features 3:1 2:4
208Found 1 lines with error.
209
210
211