Improved SVM Regression using Mixtures of Kernels G.F. Smits , E.M Jordaan t
Dow Chemical Company, Terneuzen, The Netherlands Departmentof Mathematics and Computer Science, Eindhoven University of Technology, The Netherlands
AbstractKernels are used in Support Vec tor Machines to map the learning data non linearly into a higher dimensional feature space where the computational power of the linear learning machine is increased. Every kernel has its advantages and disadvantages. A desirable characteristic for learning may not be a desirable characteristic for generalization. Preferably the good characteristics of two or more kernels should be combined. In this pa per it is shown that using mixturesofkernels can result in having both good interpolation and extrapolation abilities. The performance ofthis method is illustrated with an artificial aswellasan industrial data set.
I INTRODUCTION
Kernels are used in Support Vector Machines to map the learning data nonlinearly into a higher dimen sional feature space where the computational power of the linear learning machine is increased 4. The SVM for Regression using a kernel function and the insensitive loss function is formulated as,
A standard optimization technique to solve QP problems is by solving the dual problem through the introduction of Lagrange multipliers. For each con straint in 1there is a corresponding Lagrange Mul tiplier. For more information on the Wolf Dual for mulation, see 4 or 2.
Let the Lagrange Multipliers for the problem in l,ai and hi, correspond with the constraints in 2 and 3 defined by the data point xi,yi, respec tively. The combined weight of each data point xi is determined by a:ai. The approximating function is given in terms of the weights, the kernel and the bias,
Many of the characteristics of the model in 4 are determined by the type of kernel function used. Ev ery kernel has its advantages and disadvantages. In this paper, the influence of different types of ker nels on the interpolation and extrapolation capabil ities is investigated. In the next section two basic types of kernel functions, local and global, are dis cussed and examples of such kernels are given. In SectionThreetheinterpolationandextrapolationca pabilities of models using the two types of kernels are shown. A new type of kernel, a mixture of local and global kernels, is introduced in Section Four and the advantages of using such a mixture are shown. Fi nally, in Section Five, the mixed kernel approach is tested using an industrial data set.
ELki
s.t. w.Xibyi5E2 21,,e2
min 51 II w I2
yiw.Xi E ii 21,,c3
ti, 1 20 2l,,e,
b 5
for a given learning data set xi E X C Rn,yi E R
with e observations.
07803727860210.0002002 IEEE 2785
I1 LOCAL AND GLOBAL KERNELS
Many of the characteristics of the model in 4 are de termined by the type of kernel function being used. The level of nonlinearity is determined by the kernel function. One can use a simple inner product ker ne1 which produces a linear mapping or more com plicated kernels such as splinekernels and Fourier Transform kernels. In order to guarantee that the resulting kernel has the expansion in 4, the kernel function must satisfy Mercers Theoreml.
Numerous possibilities of kernels exist and it is difficult t o explain their individual characteristics. However,there are two main types of kernels, namely Local and Global kernels. In local kernels only the data that are close or in the proximity of each other have an influence on the kernel values. In contrast, a global kernel allows data points that are far away from each other to have an influence on the kernel values as well. For more information on the charac teristics of local and global kernels see 3.
An example of a typical local kernel is the Radial Basis Function Kernel RBF,
O4
1 08 06 04 02
1 08 06 04 02
Figure 1: Examples of a a local kernel RBF and b a global kernel Polynomial.
the local kernel. For each degree of polynomial, all data points in the input domain have nonzero kernel values. The test data point has a global effect on the other data points.
I11 INTERPOLATION AND EXTRAPOLATION
The quality of a model is not only determined by its ability to learn from the data but also its ability to predict unseen data. These two characteristics are often called learning capacity and generalization abil ity. Models are seldom good in both of the two char acteristics. A typical example is the interpolation and extrapolation abilities of a model. In the case of the SVM, these characteristic are largely determined by the choice of kernel and kernel parameters. Using a specific type of kernel has its advantages and disad vantages. The Polynomial and RBF kernels are used again for the analysis of the interpolation and ex trapolation abilities. The reason for analyzing these two types of kernels is twofold. Firstly, they can be used as representatives of a broader class of local and global kernels respectively. Secondly, these kernels have computational advantages over other kernels, since it is easier and faster to compute the kernel
I IXXiI2 Kx,xiexp
where the kernel parameter is the width, 5 , of the radial basis function. In Figure la the local effect of the RBF Kernel is shown for a chosen test input, for different values of the width r.A local kernel only has an effect on the data points in the neighborhood of the test point.
The Polynomial Kernel, a typical example of a global kernel, is defined as
where the kernel parameter q is the degree of poly nomial to be used. Note that in 6 every data point from the set x has an influence on the kernel value of the test point xi,irrespective of its the actual dis tance fromxi.In Figure lb,the global effect of the Polynomial kernel of various degrees is shown. Con sider the same test data point as used in the case of
07803727860210.00 92002 IEEE 2786
2a2
5
0 02 04 06 08 I 4x
TBI InpYI
Ibl
values.
For the analysis, the following sinc function with
an added linear component, is used:
ysin50xtX f l , 7 502
where x 250 data points was drawn from a uniform random distribution between 1,1. As a learning set x E 0.5,0.5 and the corresponding output y is used. For the SVM the models the following param eters were used: 60.01, C1000 and the linear insensitive lossfunction. In Figure 2 Support Vec
1
0.8 0.6 0.4 0.2
0 0.2 0.4 0.8 I. x
i
0.8 1
1 0.8 0.6 0.4
1 0.8 0.6 0.4
0.2 0 m x
..
09 0 b x
0.2 0.4
0.2 0.4
0.6 0.8 1
0.6 0.8 1
1 I I
,,,,I,,
Figure 3: SVM with various RBF kernels. again trained on the data between the vertical dashed
lines. Beyond these lines, the SVM will have no infor mation available and will need to extrapolate. When one attempts to extrapolate outside the input data range combined with the width of the kernel, their is no local information available and the prediction values level off. This is clearly seen in Figure 3a If one uses too large values of U , as seen in Figure 3b, the interpolation ability of RBF Kernels decreases. Therefore, no single value of the kernel parameter, c,will provide a model with both good interpolation and extrapolation properties.
IV MIXTURES OF KER NELS
From the previous section, it is observed that a poly nomial kernel a global kernel shows better extrap olationabilitiesatlowerordersofthedegrees,but requires higher orders of degrees for good interpola tion. On the other hand, the RBF kernel a local kernel has good interpolation abilities, but fails to provide longer range extrapolation. Preferably one wants to combine the good characteristics of two kernels. Therefore, we will investigate whether the advantages of polynomial and RBF kernels can be combined by using mixtures.
Figure 2: SVM with various polynomial kernels.
tor Machines using polynomial kernels of various de grees were determined. The SVMs were trained on the data between the vertical dashed lines. Beyond these lines, the SVM will have no information avail able and any prediction will be the result of extrap olation. In Figure 2a it is observed that for lower degrees of polynomial kernels the extrapolation abil itygetsbetter.However,forgoodinterpolationabil ity higher degree polynomials are required, as seen in Figure 2b. No single choice of kernel parameter, the degree of the polynomial, results in a SVM that will provide both good interpolation and extrapola tion properties.
In Figure 3 the Radial Basis Function kernel is an alyzed in the same manner. The width of the kernels were varied between 0.01 and 0.25. The SVMs were
07803727860210.00 02002IEEE 2787
There are several ways of mixing kernels. What is important though, is that the resulting kernel must be an admissible kernel. One way to guarantee that the mixed kernel is admissible, is to use a convex combination of the two kernels Kpolyand Krbf, for example
KmizPKpoly t1 pKrbf
where the optimal mixing coefficient p has to be de
termined. The value of p is a constant scalar. In
Figure 4: Example of a Mixed Kernel.
Figure 4, the effect of mixing a Polynomial kernel with a RBF Kernel is shown. The same example as in the cases of Polynomial and RBF Kernels was used to show the combined effect of using a mixture of a lo cal and global kernels. The degree of polynomial and widthofRBFwerefixedto1and0.15,respectively. The mixing coefficient was varied between 0.5 and 0.95 only. Smaller mixing coefficients are not shown, because the effect of the global kernel is insignificant below that level. Figure 4 shows that the kernel not only has a local effect, but also a global effect. By increasingthemixingcoefficient,theinfluenceofthe global kernel is also increased.
Another possibility of mixing kernels is to use dif ferent values of p for different regions of the input space. p is then a vector. Through this approach the relative contribution of both kernels to the model can be varied over the input space. In this paper, a uni form p over the entire input space is used.
The same data set used in the previous section will also be used to examine a mixture of polynomial and
Figure 5: SVM using a Mixed Kernel and various values for p.
RBF kernels. In Figure 5 the SVM predictions of the test set are given using the mixing coefficients p in the range 0.5,0.99. Here, a value of p0.95, for example, means that the relative contribution of the Polynomial Kernel to the mixed kernel is 95 whilst the RBF Kernel contributes the remaining 5. Again, the SVMs were trained on only the data be tween the vertical dashed lines. Beyond these lines, the SVM will have no information available and any prediction is the result of extrapolation. Note that the SVM models using the mixed kernel have far bet ter extrapolation capabilities than the SVMs using either the Polynomial or RBF Kernels on their own.
What is remarkable, is that only a pinch of a RBF kernel 0.01needs to be added to the Polyno mial Kernel to obtain a combination of good inter polation and extrapolation abilities for the same ker nel parameter values. It is striking that both these propertiescanbeachievedwithasinglechoiceofpa rameters for a mixed kernel. Using higher degrees of polynomials or larger widths of RBF kernels did not produce better results. In Figure 6 the effect of increasing mixing coefficients on the relative error is shown. The degree of polynomial and width of RBF were kept fixed at q1 and U0.05, respectively. The turning point where the quality of the model de creases again is very close to one. In practice, one can be satisfied with a near optimum value that is a
07803727860210.00 02002 IEEE 2788
I
Figure 6: Effect of increasing p on Relative Error. bit smaller than the exact optimum.
V INDUSTRIAL EXAMPLE
An industrial data set was used to test whether the advantages of using mixed kernels also apply to noisy, reallife data. The data were obtained from a process of the Dow Chemical Company in Freeport, USA. The input variable, a ratio of two temperatures, was selected after an extensive feature selection process using Neural Networks. All observations with tem perature ratio outside the 0.9,1.1 range, together with a number of randomly selected observations within the range were used as test data. The rest were used as learning data. The learning set consisted of 627observationsandthetestsethad561datapoints. Thelearninginputdatawererangescaledto0,1 and the resulting scaling parameters were then used
Figure 7: SVM predictions of the test set, using poly nomial kernels of various degrees.
gree polynomials are able to roughly follow the trend of the test data, but fail to predict it accurately. In creasing the polynomial order does not improve the generalization capabilities.
In the SVM Model using RBF kernels we used widths ranging between 0.1 and 0.5. Again in all models an valueof 5 was used. Figure 8 displays the predictions of the test set as well as the known input space the area between the vertical dashed lines. Thenumberofsupportvectorsusedbyeachmodel rangesbetween7and10percentofthelearningdata. One inherent property of RBF kernels is clearly seen in Figure 8: although the RBF kernel interpolates very well within the known input space, it fails to extrapolate outside the range of its width. In both cases the SVM used more or less 10 of the learn ing data as support vectors. Both models predict the learning set fairly well, but fail to predict the test set outside the known range accurately.
In Figure 9, the mixed kernel approach is shown. From the analysis of the kernel parameters it is found that an appropriate choice for kernel parameters is a second degree polynomial, combined with a RBF of width 0.15 and mixing coefficient of 0.98. In Figure 9, the model is displayed. The top graph shows the pre diction of the learning set well as the insensitive tube and support vectors encircled. The bottom graph depicts the prediction of the test set. The
0.21
1
I to scale the test data. SVM models using a polyno mial kernel and RBF kernel individually were then
1
I
I,
determined. In all models an valueof 5 was used, because that is an acceptable error level in the pro cess. For the SVM Models with polynomial kernels, degrees from 1 up to 5 were used. The predictions for the test set can be found in Figure 7. Figure 7 also indicates where a model interpolates and where it extrapolates. The model interpolates where data has been available during the training phase, indicated by the area between the vertical dashed lines. Outside this area, the model will extrapolate since there was no information available during the learning process.
The number of support vectors uscd by each model range between 8 and 11percent of the learning data. Note that the SVMs using the second and third de
Mixing Co.fnsl.nt
07803727860210.00 02002IEEE 2789
.
I
Figure 8: SVM predictions of the tcst set, using RBF kernel of various widths.
VI CONCLUSIONS
It is shown that, where thc RBF kcrncls fails to cx trapolatc and a vcry high dcgrcc Polynomial kcrncl is needed to interpolate well, the mixture of the two kernels is able to do both. Furthermore, a model that interpolates and extrapolates well can he built using a single choice for each of the kernel parameters.
Having the ability to both interpolate and extrap olatc wcll, now opens the door for making usc of prior information. If, for example, the asymptotic behav ior of a proccss is known from fundamental models, this information can bc used as prior information. The SVM using a mixture of kernels will be able not only to learn from the data hut also take into account the behavior of a proccss in the limit. Futurc work includes the analysis of mixed kernels using prior in formation.
Further investigation needs to be done into why the only a very small percentage of the RBF kcr ncl is needed. There are of course other kernels that could be uscd for mixing. For example, mixing sev eral FU3F Kernels may he useful for problems with a nonuniform data density in the input space 5.
References
lV. Chercassky and P. Mnllier, Learning from Data, Concepts, Theory and Methods, John Wi leySons., 1998.
Z N. Cristianini and J. ShawoTaylor, An Intro duction to Support Vector Machincs, and other kernelbased learning methods, Cambridge Univer sity Press, 2000.
131 A. J. Smola, Learning with kcrncls, Ph.D. Thesis, TU Berlin, 1998.
4 V. N. Vapnik, Statistical Learning Thcory, John WilcySons, 1998.
5 C. K. I. Williams, and M. Sccgcr, Thc cf fect of thc input density distribution on kernel based classifiers. Toappear in: Proceedingsof the Seventeenth International Conference on Machine Learning, 2000
70, , , , ,
llnu
,
,,,
I
I ,I 1. 1n
a..
I .I M,
Figure 9: SVM with q2,o0.15, p0.98
number of support vcctors used is also in the region of 8of the learning data. The model of the mixed kernel is able to interpolate the sharp turning point of the learning data as well as extrapolateoutside the known input space. The mixed kernel approach was also applied to higher dimensional problems and sim ilar generalization performance improvements were found.
07803727860210.00 022002 IEEE 2790
I
Reviews
There are no reviews yet.