1This directory includes some useful codes: 2 31. subset selection tools. 42. parameter selection tools. 53. LIBSVM format checking tools 6 7Part I: Subset selection tools 8 9Introduction 10============ 11 12Training large data is time consuming. Sometimes one should work on a 13smaller subset first. The python script subset.py randomly selects a 14specified number of samples. For classification data, we provide a 15stratified selection to ensure the same class distribution in the 16subset. 17 18Usage: subset.py [options] dataset number [output1] [output2] 19 20This script selects a subset of the given data set. 21 22options: 23-s method : method of selection (default 0) 24 0 -- stratified selection (classification only) 25 1 -- random selection 26 27output1 : the subset (optional) 28output2 : the rest of data (optional) 29 30If output1 is omitted, the subset will be printed on the screen. 31 32Example 33======= 34 35> python subset.py heart_scale 100 file1 file2 36 37From heart_scale 100 samples are randomly selected and stored in 38file1. All remaining instances are stored in file2. 39 40 41Part II: Parameter Selection Tools 42 43Introduction 44============ 45 46grid.py is a parameter selection tool for C-SVM classification using 47the RBF (radial basis function) kernel. It uses cross validation (CV) 48technique to estimate the accuracy of each parameter combination in 49the specified range and helps you to decide the best parameters for 50your problem. 51 52grid.py directly executes libsvm binaries (so no python binding is needed) 53for cross validation and then draw contour of CV accuracy using gnuplot. 54You must have libsvm and gnuplot installed before using it. The package 55gnuplot is available at http://www.gnuplot.info/ 56 57On Mac OSX, the precompiled gnuplot file needs the library Aquarterm, 58which thus must be installed as well. In addition, this version of 59gnuplot does not support png, so you need to change "set term png 60transparent small" and use other image formats. For example, you may 61have "set term pbm small color". 62 63Usage: grid.py [grid_options] [svm_options] dataset 64 65grid_options : 66-log2c {begin,end,step | "null"} : set the range of c (default -5,15,2) 67 begin,end,step -- c_range = 2^{begin,...,begin+k*step,...,end} 68 "null" -- do not grid with c 69-log2g {begin,end,step | "null"} : set the range of g (default 3,-15,-2) 70 begin,end,step -- g_range = 2^{begin,...,begin+k*step,...,end} 71 "null" -- do not grid with g 72-v n : n-fold cross validation (default 5) 73-svmtrain pathname : set svm executable path and name 74-gnuplot {pathname | "null"} : 75 pathname -- set gnuplot executable path and name 76 "null" -- do not plot 77-out {pathname | "null"} : (default dataset.out) 78 pathname -- set output file path and name 79 "null" -- do not output file 80-png pathname : set graphic output file path and name (default dataset.png) 81-resume [pathname] : resume the grid task using an existing output file (default pathname is dataset.out) 82 Use this option only if some parameters have been checked for the SAME data. 83 84svm_options : additional options for svm-train 85 86The program conducts v-fold cross validation using parameter C (and gamma) 87= 2^begin, 2^(begin+step), ..., 2^end. 88 89You can specify where the libsvm executable and gnuplot are using the 90-svmtrain and -gnuplot parameters. 91 92For windows users, please use pgnuplot.exe. If you are using gnuplot 933.7.1, please upgrade to version 3.7.3 or higher. The version 3.7.1 94has a bug. If you use cygwin on windows, please use gunplot-x11. 95 96If the task is terminated accidentally or you would like to change the 97range of parameters, you can apply '-resume' to save time by re-using 98previous results. You may specify the output file of a previous run 99or use the default (i.e., dataset.out) without giving a name. Please 100note that the same condition must be used in two runs. For example, 101you cannot use '-v 10' earlier and resume the task with '-v 5'. 102 103The value of some options can be "null." For example, `-log2c -1,0,1 104-log2 "null"' means that C=2^-1,2^0,2^1 and g=LIBSVM's default gamma 105value. That is, you do not conduct parameter selection on gamma. 106 107Example 108======= 109 110> python grid.py -log2c -5,5,1 -log2g -4,0,1 -v 5 -m 300 heart_scale 111 112Users (in particular MS Windows users) may need to specify the path of 113executable files. You can either change paths in the beginning of 114grid.py or specify them in the command line. For example, 115 116> grid.py -log2c -5,5,1 -svmtrain "c:\Program Files\libsvm\windows\svm-train.exe" -gnuplot c:\tmp\gnuplot\binary\pgnuplot.exe -v 10 heart_scale 117 118Output: two files 119dataset.png: the CV accuracy contour plot generated by gnuplot 120dataset.out: the CV accuracy at each (log2(C),log2(gamma)) 121 122The following example saves running time by loading the output file of a previous run. 123 124> python grid.py -log2c -7,7,1 -log2g -5,2,1 -v 5 -resume heart_scale.out heart_scale 125 126Parallel grid search 127==================== 128 129You can conduct a parallel grid search by dispatching jobs to a 130cluster of computers which share the same file system. First, you add 131machine names in grid.py: 132 133ssh_workers = ["linux1", "linux5", "linux5"] 134 135and then setup your ssh so that the authentication works without 136asking a password. 137 138The same machine (e.g., linux5 here) can be listed more than once if 139it has multiple CPUs or has more RAM. If the local machine is the 140best, you can also enlarge the nr_local_worker. For example: 141 142nr_local_worker = 2 143 144Example: 145 146> python grid.py heart_scale 147[local] -1 -1 78.8889 (best c=0.5, g=0.5, rate=78.8889) 148[linux5] -1 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333) 149[linux5] 5 -1 77.037 (best c=0.5, g=0.0078125, rate=83.3333) 150[linux1] 5 -7 83.3333 (best c=0.5, g=0.0078125, rate=83.3333) 151. 152. 153. 154 155If -log2c, -log2g, or -v is not specified, default values are used. 156 157If your system uses telnet instead of ssh, you list the computer names 158in telnet_workers. 159 160Calling grid in Python 161====================== 162 163In addition to using grid.py as a command-line tool, you can use it as a 164Python module. 165 166>>> rate, param = find_parameters(dataset, options) 167 168You need to specify `dataset' and `options' (default ''). See the following example. 169 170> python 171 172>>> from grid import * 173>>> rate, param = find_parameters('../heart_scale', '-log2c -1,1,1 -log2g -1,1,1') 174[local] 0.0 0.0 rate=74.8148 (best c=1.0, g=1.0, rate=74.8148) 175[local] 0.0 -1.0 rate=77.037 (best c=1.0, g=0.5, rate=77.037) 176. 177. 178[local] -1.0 -1.0 rate=78.8889 (best c=0.5, g=0.5, rate=78.8889) 179. 180. 181>>> rate 18278.8889 183>>> param 184{'c': 0.5, 'g': 0.5} 185 186 187Part III: LIBSVM format checking tools 188 189Introduction 190============ 191 192`svm-train' conducts only a simple check of the input data. To do a 193detailed check, we provide a python script `checkdata.py.' 194 195Usage: checkdata.py dataset 196 197Exit status (returned value): 1 if there are errors, 0 otherwise. 198 199This tool is written by Rong-En Fan at National Taiwan University. 200 201Example 202======= 203 204> cat bad_data 2051 3:1 2:4 206> python checkdata.py bad_data 207line 1: feature indices must be in an ascending order, previous/current features 3:1 2:4 208Found 1 lines with error. 209 210 211