Helsinki University of Technology
Publications in Computer and Information Science Report E3
April 2006
COMPACT MODELING OF DATA USING INDEPENDENT
VARIABLE GROUP ANALYSIS
Esa Alhoniemi Antti Honkela Krista Lagus Jeremias Seppä
Paul Wagner Harri Valpola
AB TEKNILLINEN KORKEAKOULU
TEKNISKA HÖGSKOLAN
HELSINKI UNIVERSITY OF TECHNOLOGY
TECHNISCHE UNIVERSITÄT HELSINKI
UNIVERSITE DE TECHNOLOGIE D’HELSINKI
Distribution:
Helsinki University of Technology
Department of Computer Science and Engineering
Laboratory of Computer and Information Science
P.O. Box 5400
FI-02015 TKK, Finland
Tel. +358-9-451 3267
Fax +358-9-451 3277
This report is downloadable at
http://www.cis.hut.fi/Publications/
ISBN 951-22-8166-X
ISSN 1796-2803
1
Compact Modeling of Data Using Independent
Variable Group Analysis
Esa Alhoniemi and Antti Honkela and Krista Lagus and Jeremias Seppä and Paul Wagner and Harri Valpola
Abstract— We introduce a principle called independent vari- process. Automatic discovery of such groupings would help
able group analysis (IVGA) which can be used for finding an in designing visualizations and control interfaces that reduce
efficient structural representation for a given data set. The basic the cognitive load of the user by allowing her to concentrate
idea is to determine such a grouping for the variables of the
data set that mutually dependent variables are grouped together on the essential details.
whereas mutually independent or weakly dependent variables Analyzing and modeling intricate and possibly nonlinear
end up in separate groups. dependencies between a very large number of real-valued vari-
Computation of any model that follows the IVGA principle ables (features) is a hard problem. Learning such models from
requires a combinatorial algorithm for grouping of the variables data generally requires very much computational power and
and a modeling algorithm for the groups. In order to be able to
compare different groupings, a cost function which reflects the memory. If one does not limit the problem by assuming only
quality of a grouping is also required. Such a cost function can linear or other restricted dependencies between the variables,
be derived for example using the variational Bayesian approach, essentially the only way to do this is to actually try to model
which is employed in our study. This approach is also shown to be the data set using different model structures. One then needs a
approximately equivalent to minimizing the mutual information principled way to score the structures, such as a cost function
between the groups.
The modeling task is computationally demanding. We describe that accounts for the model complexity as well as model
an efficient heuristic grouping algorithm for the variables and accuracy.
derive a computationally light nonlinear mixture model for The remainder of the article is organized as follows. In
modeling the dependencies within the groups. Finally, we carry Section II we describe a computational principle called Inde-
out a set of experiments which indicate that the IVGA principle pendent Variable Group Analysis (IVGA) by which one can
can be beneficial in many different applications.
learn a structuring of the problem from data. In short, IVGA
Index Terms— compact modeling, independent variable group does this by finding a partition of the set of input variables
analysis, mutual information, variable grouping, variational that minimizes the mutual information between the groups,
Bayesian learning
or equivalently the cost of the overall model, including the
cost of the model structure and the representation accuracy of
I. I NTRODUCTION the model. Its connections to related methods are discussed in
The study of effective ways of finding compact repres- Section II-B.
entations from data is important for the automatic analysis The problem of modeling-based estimation of mutual in-
and exploration of complex data sets and natural phenomena. formation is discussed in Section III. The approximation
Finding properties of the data that are not related can help in turns out to be equivalent to variational Bayesian learning.
discovering compact representations as it saves from having Section III also describes one possible computational model
to model the mutual interactions of unrelated properties. for representing a group of variables as well as the cost
It seems evident that humans group related properties as a function for that model. The algorithm that we use for finding
means for understanding complex phenomena. An expert of a a good grouping is outlined in Section IV along with a number
complicated industrial process such as a paper machine may of speedup techniques.
describe the relations between different control parameters In Section V we examine how well the IVGA principle and
and measured variables by groups: A affects B and C, and the current method for solving it work both on an artificial
so on. This grouping is of course not strictly valid as all toy problem and two real data sets of printed circuit board
the variables eventually depend on each other, but it helps assembly component database setting values and ionosphere
in describing the most important relations, and thus makes radar measurements.
it possible for the human to understand the system. Such Initially, the IVGA principle and an initial computational
groupings also significantly help the interaction with the method was introduced in [1], and some further experiments
were presented in [2]. In the current article we derive the con-
E. Alhoniemi is with the Department of Information Technology, Univer- nection between mutual information and variational Bayesian
sity of Turku, Lemminkäisenkatu 14 A, FI-20520 Turku, Finland. (e-mail: learning and describe the current, improved computational
esa.alhoniemi@utu.fi)
A. Honkela, K. Lagus, J. Seppä, and P. Wagner are with the Adaptive In- method in more detail. The applied mixture model for mixed
formatics Research Centre, Helsinki University of Technology, P.O. Box 5400, real and nominal data is presented along with derivation of the
FI-02015 TKK, Finland. (e-mail: antti.honkela@tkk.fi, krista.lagus@tkk.fi) cost function. Details of the grouping algorithm and necessary
H. Valpola is with the Laboratory of Computational Engineering, Helsinki
University of Technology, P.O. Box 9203, FI-02015 TKK, Finland. (e-mail: speedups are also presented. Completely new experiments
harri.valpola@tkk.fi) include an application of IVGA to supervised learning.
2
Dependencies in the data:
A. Motivation for Using IVGA
X Y Z
The computational usefulness of IVGA relies on the fact
that if two variables are dependent of each other, representing
them together is efficient, since redundant information needs
A B C D E F G H
to be stored only once. Conversely, joint representation of
variables that do not depend on each other is inefficient.
IVGA identifies: Mathematically speaking, this means that the representation of
a joint probability distribution that can be factorized is more
Group 1 Group 2 Group 3
X compact than the representation a full joint distribution. In
Y Z
terms of a problem expressed using association rules of the
A C F H form (A=0.3, B=0.9 ⇒ F=0.5, G=0.1): the shorter the rules
D E
B G that represent the regularities within a phenomenon, the more
compact the representation is and the fewer association rules
are needed. IVGA can also be given a biologically inspired
Fig. 1. An illustration of the IVGA principle. The upper part of the figure
shows the actual dependencies between the observed variables. The arrows motivation: With regard to the structure of the cortex, the
that connect variables indicate causal dependencies. The lower part depicts difference between a large monolithic model and a set of
the variable groups that IVGA might find here. One actual dependency is left models produced by the IVGA roughly corresponds to the
unmodeled, namely the one between Z and E. Note that the IVGA does not
reveal causalities, but dependencies between the variables only. contrast between full connectivity (all cortical areas receive
inputs from all other areas) and more limited, structured
connectivity.
II. I NDEPENDENT VARIABLE G ROUP A NALYSIS (IVGA) The IVGA principle has been shown to be sound: a very
P RINCIPLE simple initial method described in [1] found appropriate vari-
able groups from data where the features were various real-
The ultimate goal of Independent Variable Group Analysis valued properties of natural images. Recently, we have exten-
(IVGA) [1] is to partition a set of variables (also known ded the model to handle also nominal (categorical) variables,
as attributes or features) into separate groups so that the improved the variable grouping algorithm, and carried out
statistical dependencies of the variables within each group experiments on various different data sets.
are strong. These dependencies are modeled, whereas the The IVGA can be viewed in many different ways. First, it
weaker dependencies between variables in different groups are can be seen as a method for finding compact representation
disregarded. The IVGA principle is depicted in Fig. 1.
of data using multiple independent models. Secondly, IVGA
We wish to emphasize that IVGA should be seen as a can be seen as a method of clustering variables. Note that it
principle, not an algorithm. However, in order to determine is not equivalent to taking the transpose of the data matrix
a grouping for observed data, a combinatorial grouping al- and performing ordinary clustering, since dependent variables
gorithm for the variables is required. Usually this algorithm need not be close to each other in the Euclidean or any
is heuristic since exhaustive search over all possible variable other common metric. Thirdly, IVGA can also be used as
groupings is computationally infeasible. a dimensionality reduction or feature selection method. The
The combinatorial optimization algorithm needs to be com- review of related methods in Section II-B will concentrate
plemented by a method to score different groupings or a cost mainly on the first two of these topics.
function for the groups. Suitable cost functions can be derived
in a number of ways, such as using the mutual information
between different groups or as the cost of an associated model B. Related Work
under a suitable framework such as minimum description One of the basic goals of unsupervised learning is to
length (MDL) or variational Bayes. All of these alternatives obtain compact representations for observed data. The methods
are actually approximately equivalent, as presented in Sec. III. reviewed in this section are related to IVGA in the sense
It should be noted that the models used in the model-based that they aim at finding a compact representation for a data
approaches need not be of any particular type—as a matter set using multiple independent models. Such methods include
of fact, the models within a particular modeling problem do multidimensional independent component analysis (MICA,
not necessarily need to be of same type, that is, each variable also known as independent subspace analysis, ISA) [3] and
group could even be modeled using a different model type. factorial vector quantization (FVQ) [4], [5].
It is vital that the models for the groups are fast to In MICA, the goal is to find independent linear feature
compute and that the grouping algorithm is efficient, too. In subspaces that can be used to reconstruct the data efficiently.
Section IV-A, such a heuristic grouping algorithm is presented. Thus each subspace is able to model the linear dependencies in
Each variable group is modeled by using a computationally terms of the latent directions defining the subspace. FVQ can
relatively light mixture model which is able to model nonlinear be seen as a nonlinear version of MICA, where the component
dependencies between both nominal and real valued variables models are vector quantizers over all the variables. The main
at the same time. Variational Bayesian modeling is considered difference between these and IVGA is that in IVGA, only
in Section III, which also contains derivation of the mixture one model affects a given observed variable. In contrast in
model. the others, all models affect every observed variable. This
3
MICA / ISA FVQ
Cardoso (1998) Hinton & Zemel (1994)
x
x
x1
+
+ ...
... ...
...
+
Subspace of
the original + VQ for all ...
space (linear) the variables
x9 (nonlinear)
IVGA
x
Any method for modeling
dependencies within
a variable group
Fig. 2. Schematic illustrations of IVGA and related algorithms, namely MICA/ISA and FVQ that each look for multi-dimensional feature subspaces in effect
by maximizing a statistical independence criterion. The input x is here 9-dimensional. The numbers of squares in FVQ and IVGA denote the numbers of
variables modeled in each sub-model, and the numbers of black arrows in MICA the dimensionality of the subspaces. Note that with IVGA the arrows depict
all the required connections, whereas with FVQ and MICA only a subset of the actual connections have been drawn (6 out of 27).
difference, visualized in Fig. 2, makes the computation of Module networks [8] are a very specific class of models
IVGA significantly more efficient. that is based on grouping similar variables together. They
There are also a few other methods for grouping the vari- are used only for discrete data and all the variables in a
ables based on different criteria. A graph-theoretic partitioning group are restricted to have exactly the same distribution.
of the graph induced by a thresholded association matrix The dependencies between different groups are modeled as
between variables was used for variable grouping in [6]. a Bayesian network. Sharing the same model within a group
The method requires choosing an arbitrary threshold for the makes the model easier to learn from scarce data, but severely
associations, but the groupings could nevertheless be used to restricts its possible uses.
produce smaller decision trees with equal or better predictive For certain applications, it may be beneficial to view IVGA
performance than using the full dataset. as a method for clustering variables. In this respect it is
A framework for grouping variables of a multivariate time related to methods such as double clustering, co-clustering
series based on possibly lagged correlations was presented and biclustering which also form a clustering not only for the
in [7]. The correlations are evaluated using Spearman’s rank samples, but for the variables, too [9], [10]. The differences
correlation that can find both linear and monotonic nonlinear between these clustering methods are illustrated in Fig. 3.
dependencies. The grouping method is based on a genetic
algorithm, although other possibilities are presented as well. III. A M ODELING -BASED A PPROACH TO E STIMATING
The method seems to be able to find reasonable groupings, M UTUAL I NFORMATION
but it is restricted to time series data and certain types of Estimating mutual information of high dimensional data
dependencies only. is very difficult as it requires an estimate of the probability
4
Variables Variables Variables probability density estimate implied by a model has been
applied for evaluating mutual information also in [12].
Samples
Samples
Samples
Using the result of Eq. (2), minimizing the criterion of
Eq. (1) is equivalent to maximizing
X
L= log p({Dj |j ∈ Gi }|Hi ). (3)
i
Clustering Biclustering IVGA This reduces the problem to a standard Bayesian model
selection problem. The two problems are, however, not ex-
Fig. 3. Schematic illustrations of the IVGA together with regular clustering actly equivalent. The mutual information cost (1) is always
and biclustering. In biclustering, homogeneous regions of the data matrix are minimized when all the variables are in a single group,
sought for. The regions usually consist of a part of the variables and a part of
the samples only. In IVGA, the variables are clustered based on their mutual or multiple statistically independent groups. In case of the
dependencies. If the individual groups are modeled using mixture models, a Bayesian formulation (3), the global minimum may actually
secondary clustering of each group is also obtained, as marked by the dashed be reached for a nontrivial grouping even if the variables are
lines in the rightmost subfigure.
not exactly independent. This allows determining a suitable
number of groups even in realistic situations when there are
density. We propose solving the problem by using a model- weak residual dependencies between the groups.
based density estimate. With some additional approximations
the problem of minimizing the mutual information reduces to a B. Variational Bayesian Learning
problem of maximizing the marginal likelihood p(D|H) of the Unfortunately evaluating the exact marginal likelihood is
model. Thus minimization of mutual information is equivalent intractable for most practical models as it requires evaluating
to finding the best model for the data. This model comparison an integral over a potentially high dimensional space of all the
task can be performed efficiently using variational Bayesian model parameters θ. This can be avoided by using a variational
techniques. method to derive a lower bound of the marginal log-likelihood
using Jensen’s inequality
Z
A. Approximating the Mutual Information
log p(D|H) = log p(D, θ|H) dθ
Let us assume that the data set D consists of vectors Zθ
x(t), t = 1, . . . , T . The vectors are N -dimensional with the p(D, θ|H)
= log q(θ) dθ (4)
individual components denoted by xj , j = 1, . . . , N . Our q(θ)
Z θ
aim is to find a partition of {1, . . . , N } to M disjoint sets p(D, θ|H)
G = {Gi |i = 1, . . . , M } such that the mutual information ≥ log q(θ) dθ
θ q(θ)
X
IG (x) = H({xj |j ∈ Gi }) − H(x) (1) where q(θ) is an arbitrary distribution over the parameters. If
i q(θ) is chosen to be of a suitable simple factorial form, the
bound can be rather easily evaluated exactly.
between the sets is minimized. In case M > 2, this is actually
Closer inspection of the right hand side of Eq. (4) shows
a generalization of mutual information commonly known as
that it is of the form
multi-information [11]. As the entropy H(x) is constant, this Z
can be achieved by minimizing the first sum. The entropies of p(D, θ|H)
B= log q(θ) dθ
that sum can be approximated through θ q(θ) (5)
T
= log p(D|H) − DKL (q(θ)||p(θ|H, D)),
1 X
Z
H(x) = − p(x) log p(x) dx ≈ − log p(x(t)) where DKL (q||p) is the Kullback–Leibler divergence between
T t=1
distributions q and p. The Kullback–Leibler divergence
T
1 X (2) DKL (q||p) is non-negative and zero only when q = p. Thus it
≈− log p(x(t)|x(1), . . . , x(t − 1), H) is commonly used as a distance measure between probability
T t=1
distributions although it is not a proper metric [13]. For a
1
=− log p(D|H). more through introduction to variational methods, see for
T example [14].
Two approximations were made in this derivation. First, the In addition to the interpretation as a lower bound of the
expectation over the data distribution was replaced by a marginal log-likelihood, the quantity −B may also be in-
discrete sum using the data set as a sample of points from the terpreted as a code length required for describing the data
distribution. Next, the data distribution was replaced by the using a suitable code [15]. The code lengths can then be used
posterior predictive distribution of the data sample given the to compare different models, as suggested by the minimum
past observations. The sequential approximation is necessary description length (MDL) principle [16]. This provides an
to avoid the bias caused by using the same data twice, alternative justification for the variational method. Addition-
both for sampling and for fitting the model for the same ally, the alternative interpretation can provide more intuitive
point. A somewhat similar approximation based on using the explanations on why some models provide higher marginal
5
c3 πc
In this section, we describe an adaptive heuristic grouping
c1 c2 c3
algorithm for determination of the best grouping for the
x6 x7 x8
variables which is currently used in our IVGA implementation.
T
After that, we also present three special techniques which are
x1 x2 x3 x4 x5 x6 x7 x8
used to speed up the computation.
µ6 ρ6 µ7 ρ7 π8
C
A. The Algorithm
Fig. 4. Our IVGA model as a graphical model. The nodes represent
variables of the model with the shaded ones being observed. The left-hand
The goal of the algorithm is to find such a variable grouping
side shows the overall structure of the model with independent groups. The and such models for the groups that the total cost over all
right-hand side shows a more detailed representation of the mixture model of the models is minimized. The algorithm has an initialization
a single group of three variables. Variable c indicates the generating mixture
component for each data point. The boxes in the detailed representation
phase and a main loop during which five different operations
indicate that there are T data points and in the rightmost model there are are consecutively applied to the current models of the variable
C mixture components representing the data distribution. Rectangular and groups and/or to the grouping until the end condition is met.
circular nodes denote discrete and continuous variables, respectively.
A flow-chart illustration of the algorithm is shown in Fig. 5
and the phases of the algorithm are explained in more detail
below.
likelihoods than others [17]. For the remainder of the paper,
the optimization criterion will be the cost function Initialization. Each variable is assigned into a group of its
Z own and a model for each group is computed.
q(θ) Main loop. The following five operations are consecutively
C = −B = log q(θ) dθ
θ p(D, θ|H) (6) used to alter the current grouping and to improve the
= DKL (q(θ)||p(θ|H, D)) − log p(D|H) models of the groups. Each operation of the algorithm is
that is to be minimized. assigned a probability which is adaptively tuned during
the main loop: If an operation is efficient in minimizing
the total cost of the model, its probability is increased
C. Mixture Model for the Groups and vice versa.
In order to apply the variational Bayesian method described Model recomputation. The purpose of this operation in
above to solve the IVGA problem, a class of models that twofold. (1) It tries to find an appropriate complexity
benefits from modeling independent variables independently for the model for a group of variables—which is
is needed for the groups. In this work mixture models have the number of mixture components in the mixture
been used for the purpose. Mixture models are a good choice model. (2) It tests different model initializations in
because they are simple while being able to model also order to avoid local minima of the cost function of
nonlinear dependencies. Our IVGA model is illustrated as a the model. As the operation is performed multiple
graphical model in Fig. 4. times for a group, an appropriate complexity and good
As shown in Fig. 4, different variables are assumed to be initialization is found for the model of the group.
independent within a mixture component and the dependencies A mixture model for a group is recomputed so that the
only arise from the mixture. For continuous variables, the number of mixture components may decrease, remain
mixture components are Gaussian and the assumed independ- the same, or increase. It is slightly more probable
ence implies a diagonal covariance matrix. Different mixture that the number of components grows, that is, a more
components can still have different covariances [18]. The complex model is computed. Next, the components
applied mixture model closely resembles other well-known are initialized, for instance in the case of a Gaussian
models such as soft c-means clustering and soft vector quant- mixture by randomly selecting the centroids among the
ization [19]. training data, and the model is roughly trained for some
For nominal variables, the mixture components are multino- iterations. If a model for the group had been computed
mial distributions. All parameters of the model have standard earlier, the new model is compared to the old model.
conjugate priors. The exact definition of the model and the The model with the smaller cost is selected as the
approximation used for the variational Bayesian approach are current model for the group.
presented in Appendix I and the derivation of the cost function Model fine-tuning. When a good model for a group of
in Appendix II. variables has been found, it is sensible to fine-tune it
further so that its cost approaches a local minimum of
IV. A VARIABLE G ROUPING A LGORITHM FOR IVGA the cost function. During training, the model cost is
The number of possible groupings of n variables is called never increased due to characteristics of the training
the nth Bell number Bn . The values of Bn grow with n algorithm.
faster than exponentially, making an exhaustive search of all However, tuning a model of a group takes many
groupings infeasible. For example, B100 ≈ 4.8 · 10115. Hence, iterations of the learning algorithm and it is not sensible
some computationally feasible heuristic — which can naturally to do that for all the models that are used.
be any standard combinatorial optimization algorithm — for Moving a variable. This operation improves an existing
finding a good grouping has to be deployed. grouping so that a single variable which is in a wrong
6
START
Initialize:
Assign each variable into Initialize
a group of its own and Low−level functions
compute a model for each group
Recompute:
Randomly choose one group P(recompute) Yes Initialize and
rand() > P Recompute train roughly
and change complexity and/or
initialization of its model
No
Fine−tune:
Yes
Randomly choose one group P(fine−tune) rand() > P Fine−tune Fine−tune
and improve its model by
training it further
No
Move:
Yes Estimate cost
Move one randomly selected P(move) rand() > P Move
variable to every other group of move
(also to a group of its own) No
Merge: Yes Recompute
P(merge) rand() > P Merge
Combine two groups into one model
No
Split:
Yes Get model
Randomly choose two variables P(split) rand() > P Split cost
and call IVGA recursively for the
group or groups they belong to No
Compute
efficiency of
each operation
Previously
computed
No End models
condition
met?
Yes
END
Fig. 5. An illustration of the variable grouping algorithm for IVGA. The solid line describes control flow, the dashed lines denote low-level subroutines
and their calls so that the arrow points to the called routine. The dotted line indicates adaptation of the probabilities of the five operations. Function rand()
produces a random number on the interval [0,1].
group is moved to a more appropriate group. First, one ing groups. The group(s) are chosen so that two vari-
variable is randomly selected among all the variables ables are randomly selected among all the variables.
of all groups. The variable is removed from its original The group(s) corresponding to the variables are then
group and moved to every other group (also to a group taken for the operation. Hence, the probability of a
of its own) at a time. For each new group candidate, group to be selected is proportional to the size of the
the cost of the model is roughly estimated. If the move group. As a result, more likely heterogeneous large
reduces the total cost compared to the original one, the groups are chosen more frequently than smaller ones.
variable is moved to a group which yields the highest The operation recursively calls the algorithm for the
decrease in the total cost. union of the selected groups. If the total cost of the
Merge. The goal of the merge operation is to combine resulting models is less than the sum of the costs of
two groups in which the variables are mutually depend- the original group(s), the original group(s) are replaced
ent. In the operation, two groups are selected randomly by the new grouping. Otherwise, the original group(s)
among the current groups. A model for the variables are retained.
of their union is computed. If the cost of the model End condition. Iteration is stopped if the decrease of the total
of the joint group is smaller than the sum of the costs cost is very small in several successive iterations.
of the two original groups, the two groups are merged.
Otherwise, the two original groups are retained.
Split. The split operation breaks down one or two exist-
7
B. Speedup Techniques Used in Computation of the Models Note that this model compression principle is completely
Computation of an IVGA model for a large set of variables general: it can be applied in any algorithm in which compres-
requires computation of a huge number of models (say, thou- sion of multiple models is required.
3) Fast Estimation of Model Costs When Moving a Vari-
sands), because in order to determine the cost of an arbitrary
able: When the move of a variable from one group to
variable group, a unique model for it needs to be computed (or,
all the other groups is attempted, computationally expensive
at least, an approximation of the cost of the model). Therefore,
evaluation of the costs of multiple models is required. We use
fast and efficient computation of models is crucial. We use the
a specialized speedup technique for fast approximation of the
following three special techniques are used in order to speed
costs of the groups: Before moving a variable to another group
up the computation of the models.
for real, a quick pessimistic estimate of the total cost change
1) Adaptive Tuning of Operation Probabilities: During the
of the move is calculated, and only those new models that
main loop algorithm described above, five operations are used
look appealing are tested further.
to improve the grouping and the models. Each operation
When calculating the quick estimate for the cost change
has a probability which dictates how often the corresponding
if a variable is moved from one to another, the posterior
operation is performed (see Fig. 5). As the grouping algorithm
probabilities of the mixture components are fixed and only the
is run for many iterations, the probabilities are slowly adapted
parameters of the components related to the moved variable are
instead of keeping them fixed because
changed. The cost of these two groups is then calculated for
• it is difficult to determine probabilities which are appro- comparison with their previous cost. The approximation can
priate for an arbitrary data set; and be justified by the fact that if a variable is highly dependent on
• during a run of the algorithm, the efficiency of different the variables in a group, then the same mixture model should
operations varies—for example, the split operation is fit it as well.
seldom beneficial in the beginning of the iteration (when
the groups are small), but it becomes more useful when V. A PPLICATIONS , E XPERIMENTS
the sizes of the groups tend to grow. Problems in which IVGA can be found to be useful can be
The adaptation is carried out by measuring the efficiency divided into the following categories. First, IVGA can be used
(in terms of reduction of the total cost of all the mod- for confirmatory purposes in order to verify human intuition of
els) of each operation. The probabilities of the operations an existing grouping of variables. The first synthetic problem
are gradually adapted so that the probability of an efficient presented in Section V-A can be seen as an example of this
operation is increased and the probability of an inefficient type. Second, IVGA can be used to explore observed data,
operation decreased. The adaptation is based on low-pass that is, to make hypotheses or learn the structure of the data.
filtered efficiency, which is defined by The discovered structure can then be used to divide a complex
modeling problem into a set of simpler ones as illustrated in
∆C Section V-B. Third, if we are dealing with a classification
efficiency = − (7)
∆t problem, we can use IVGA to reveal the variables that are
where ∆C is the change in the total cost and ∆t is the amount dependent with the class variable. In other words, we can use
of CPU time used for the operation. IVGA also for variable selection or dimension reduction in
Based on multiple tests (not shown here) using various supervised learning problems. This is illustrated in Section V-
data sets, it has turned out that adaptation of the operation C.
probabilities instead of keeping them fixed significantly speeds
A. Toy Example
up the convergence of the algorithm into a final grouping.
2) “Compression” of the Models: Once a model for a In order to illustrate our IVGA algorithm using a simple and
variable group is computed, it is sensible to be stored, because easily understandable example, a data set consisting of one
it is a previously computed good model for a certain variable thousand points in a four-dimensional space was synthesized.
group may be later needed. The dimensions of the data are called education, income,
Computation of many models—for example, a mixture height, and weight. All the variables are real and the units
model—is stochastic, because often a model is initialized are arbitrary. The data was generated from a distribution in
randomly and trained for a number of iterations. However, which both education and income are statistically independent
computation of such a model is actually deterministic provided of height and weight.
that the state of the (deterministic) pseudorandom number Fig. 6 shows plots of education versus income, height vs.
generator when the model was initialized is known. Thus, in weight, and for comparison a plot of education vs. height.
order to reconstruct a model after it has been once computed, One may observe, that in the subspaces of the first two plots
we need to store (i) the random seed, (ii) the number of of Fig. 6, the data points lie in few, more concentrated clusters
iterations that were used to train the model, and (iii) the model and thus can generally be described (modeled) with a lower
structure. Additionally, it is also sensible to store (iv) the cost cost in comparison to the third plot. As expected, when the
of the model. So, a mixture model can be compressed into data was given to our IVGA model, the resulting grouping
two floating point numbers (the random seed and the cost of was
the model) and two integers (the number of training iterations {{education, income}, {height, weight}}.
and the number of mixture components). Table I compares the costs of some possible groupings.
8
50 Grouping Total Cost Parameters
{e,i,h,w} 12233.4 288
{e,i}{h,w} 12081.0 80
45
{e}{i}{h}{w} 12736.7 24
{e,h}{i}{w} 12739.9 24
40 {e,i}{h}{w} 12523.9 40
{e}{i}{h,w} 12304.0 56
35
Income
TABLE I
A COMPARISON OF THE TOTAL COSTS OF SOME VARIABLE GROUPINGS OF
30
THE SYNTHETIC DATA . T HE VARIABLES EDUCATION , INCOME , HEIGHT,
AND WEIGHT ARE DENOTED HERE BY THEIR INITIAL LETTERS . A LSO
25
SHOWN IS THE NUMBER OF REAL NUMBERS REQUIRED TO PARAMETERIZE
THE LEARNED OPTIMAL G AUSSIAN MIXTURE COMPONENT
20
DISTRIBUTIONS . T HE TOTAL COSTS ARE FOR MIXTURE MODELS
15
OPTIMIZED CAREFULLY USING OUR IVGA ALGORITHM . T HE MODEL
10 15 20 25
SEARCH OF OUR IVGA ALGORITHM WAS ABLE TO DISCOVER THE BEST
Education
220 GROUPING , THAT IS , THE ONE WITH THE SMALLEST COST.
210
200
added to the existing component database of the robot by
190 a human operator. The component data can be seen as a
matrix. Each row of the matrix contains attribute values of one
Height
180
component and the columns of the matrix depict component
170 attributes, which are not mutually independent. Building an
input support system by modeling of the dependencies of
160 the existing data using association rules has been considered
150
in [20]. A major problem of the approach is that extraction of
the rules is computationally heavy, and memory consumption
140
40 50 60 70 80 90 100 110
of the predictive model which contains the rules (in our case,
Weight a trie) is very high.
220 We divided the component data of an operational assembly
robot (5 016 components, 22 nominal attributes) into a training
210
set (80 % of the whole data) and and a testing set (the rest 20
200 %). The IVGA algorithm was run 200 times for the training
set. In the first 100 runs (avg. cost 113 003), all the attributes
190
were always assigned into one group. During the last 100
Height
180 runs (avg. cost 113 138) we disabled the adaptation of the
probabilities (see Section IV-A) to see if this would have an
170
effect on the resulting groupings. In these runs, we obtained
160
75 groupings with 1 group and 25 groupings with 2–4 groups.
Because we were looking for a good grouping with more than
150 one group, we chose a grouping with 2 groups (7 and 15
140
attributes). The cost of this grouping was 112 387 which was
10 15 20 25
not the best among all the results over 200 runs (111 791), but
Education
not very far from it.
Fig. 6. Comparison of different two-dimensional subspaces of the data. Due Next, the dependencies of (1) the whole data and (2)
to the dependencies between the variables shown in the first two pictures it is the 2 variable groups were modeled using association rules.
useful to model those variables together. In contrast, in the last picture no such
dependency is observed and therefore no benefit is obtained from modeling
The large sets required for computation of the rules were
the variables together. computed using a freely available software implementation1
of the Eclat algorithm [21]. Computation of the rules requires
two parameters: minimum support (“generality” of the large
B. Printed Circuit Board Assembly sets that the rules are based on) and minimum confidence
(“accuracy” of the rule). The minimum support dictates the
In the second experiment, we constructed predictive models number of large sets, which is in our case equal to the size of
to support and speed up user input of component data of a the model. For the whole data set, the minimum support was
printed circuit board assembly robot. When a robot is used 5 %, which was the smallest computationally feasible value
in the assembly of a new product which contains components
that have not been previously used by the robot, the data of 1 See http://www.adrem.ua.ac.be/∼goethals/software/
the new components need to be manually determined and index.html
9
90
in terms of memory consumption. For the models of the two
groups it set to 0.1 %, which was the smallest as possible value
88
so that the combined size of the two models did not exceed the
size of the model for the whole data. The minimum confidence
86
was set to 90 %, which is a typical value for the parameter in
many applications.
Accurracy [%]
84
The rules were used for one-step prediction of the attribute
values of the testing data. The data consisted of values selected 82
and verified by human operators, but it is possible that these
are not the only valid values. Nevertheless, predictions were 80
ruled incorrect if they differed from these values. Computation 34 (all) variables
times, memory consumption, and prediction accuracy for the 78 3 variables selected by IVGA
whole data and the grouped data are shown in Table II.
Grouping of the data both accelerated computation of the 76
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
rules and improved the prediction accuracy. Also note that k [# of nearest neighbors]
the combined size of the models of the two groups is only
about 1/4 of the corresponding model for the whole data.
Fig. 7. Classification accuracies for the Ionosphere data using k-NN classifier
with all the variables (white markers) and with only the variables selected
Whole Grouped using IVGA (black markers).
data data
Computation time (s) 48 9.1
Size of trie (nodes) 9 863 698 2 707 168
Correct predictions (%) 57.5 63.8 The classification was carried out using the k-nearest-
Incorrect predictions (%) 3.7 2.9
Missing predictions (%) 38.8 33.3
neighbor (k-NN) classifier. Out of the 351 samples 51 were
used for testing and the rest for training. In each experiment,
TABLE II
the testing and the training data sets were randomly drawn
S UMMARY OF THE RESULTS OF THE COMPONENT DATA EXPERIMENT. A LL
from the entire data set and normalized prior to classification.
THE QUANTITIES FOR THE GROUPED DATA ARE SUMS OVER THE TWO
The averaged results of 1 000 different runs are shown in Fig. 7
GROUPS . A LSO NOTE THAT THE SIZE OF TRIE IS IN THIS PARTICULAR
with various (odd) values for k. For comparison, the same
APPLICATION THE SAME AS THE NUMBER OF ASSOCIATION RULES .
experiment was carried out using all the 34 variables. As can
be seen, the set of three features chosen using IVGA produces
The potential benefits of the IVGA in an application of this clearly better classification accuracy than the complete set of
type are as follows. (1) It is possible to compute rules which features whenever k > 1. For example, for k = 5 the accuracy
yield better prediction results, because the rules are based using IVGA was 89.6 % while for the complete set of features
on small amounts of data, i.e, it is possible to use smaller it was 84.8 %.
minimum support for the grouped data. (2) Discretization of Extensive benchmarking experiments using the Ionosphere
continuous variables—which is often a problem in applica- data set that compare PCA and Random Projection for dimen-
tions of association rules—is automatically carried out by the sionality reduction with a number of classifiers are reported
mixture model. (3) Computation of the association rules may in [23]. They also report accuracy in the original input space
even be completely ignored by using the mixture models of for each method. For 1-NN this value is 86.7 %, with 5-
the groups as a basis for the predictions. Of these, (1) was NN 84.5 %, and with a linear SVM classifier 87.8 %. The
demonstrated in the experiment whereas (2) and (3) remain a best result obtained using dimension reduction was 88.7 %.
topic for future research. We used an identical test setting in our experiments with the
difference that feature selection was performed using IVGA.
Using the k-NN classifier we obtained better accuracies than
C. Feature Selection for Supervised Learning: Ionosphere any of the classifiers used in [23], including linear SVM, when
Data they were performed in the original input space. Moreover,
In this experiment, we investigated whether the variable IVGA was able to improve somewhat even upon the best
grouping ability could be used for feature selection for clas- results that they obtained in the reduced-dimensional spaces.
sification. One way to apply our IVGA model in this manner We also tested nonlinear SVM using Gaussian kernel by using
is to see which variables IVGA groups together with the class the same software2 with default settings that was used in [23].
variable, and to use only these in the actual classifier. For the entire data set the prediction accuracy was weak, only
We ran our IVGA algorithm 10 times for the the Ionosphere 66.1 % whereas using the three variables selected by IVGA it
data set [22], which contains 351 instances of radar measure- was the best among all the results in the experiment, 90.7 %.
ments consisting of 34 attributes and a binary class variable. A number of heuristic approaches to feature selection like
From the three groupings (runs) with the lowest cost, each forward, backward, and floating search methods (see e.g. [25])
variable that was grouped with the class variable at least once exist and could have been used here as well. However, the goal
was included in the classification experiment. As a result, the
following three features were chosen: {1, 5, 7}. 2 See [24] and http://svmlight.joachims.org/.
10
of the experiment was not to find the best set of features but approach has been shown to be useful in real-world problems:
to demonstrate that the IVGA can reveal useful structure of It decreases computational burden of other machine learning
the data. methods and also increases their accuracy by letting them
concentrate on the essential dependencies of the data.
VI. D ISCUSSION The general nature of the IVGA principle allows many
potential applications. The method can be viewed as a tool
Many real-world problems and data sets can be divided
for compact modeling of data, an algorithm for clustering
into smaller relatively independent subproblems. Automatic
variables, or as a tool for dimensionality reduction and feature
discovery of such divisions can significantly help in applying
selection. All these interpretations allow for several practical
different machine learning techniques to the data by reducing
applications.
computational and memory requirements of processing. The
Biclustering – clustering of both variables and samples – is
IVGA principle calls for finding the divisions by partitioning
very popular in bioinformatics. In such applications it could
the observed variables into separate groups so that the mutual
be useful to ease the strict grouping of the variables of IVGA.
dependencies between variables within a group are strong
This could be accomplished by allowing different partitions in
whereas mutual dependencies between variables in different
different parts of the data set using, for instance, a mixture-
groups are weaker.
of-IVGAs type of model. Hierarchical modeling of residual
In this paper, the IVGA principle has been implemented by
dependencies between the groups would be another interesting
a method that groups the input variables only. In the end, there
extension.
may also exist interesting dependencies between the individual
variable groups. One avenue for future research is to extend the
grouping model into a hierarchical IVGA that is able to model A PPENDIX I
the residual dependencies between the groups of variables. S PECIFICATION OF THE M IXTURE M ODEL
From the perspective of using the method it would be useful A mixture model for the random variable x(t) can be
to implement many different model types including also linear written with the help of an auxiliary variable c(t) denoting
models. This would allow the modeling of each variable group the index of the active mixture component as illustrated in
with the best model type for that particular sub-problem, and the right part of Fig. 4. In our IVGA model, the mixture
depending on the types of dependencies within the problem. model for the variable groups is chosen to be as simple as
Such extensions naturally require the derivation of a cost possible for computational reasons. This is done by restricting
function for each additional model family, but there are simple the components p(x(t)|θi , H) of the mixture to be such that
tools for automating this process [26], [27]. different variables are assumed independent. This yields
The stochastic nature of the grouping algorithm makes its X
p(x(t)|H) = p(x(t)|θi , H)p(c(t) = i)
computational complexity difficult to analyze. Empirically the
i
complexity of convergence to a neighborhood of a locally X Y (8)
optimal grouping seems to be roughly quadratic with respect = p(c(t) = i) p(xj (t)|θi,j , H),
i j
to both the number of variables and the number of data
samples. In case of number of samples this is because the where θi,j are the parameters of the ith mixture component
data does not exactly follow the mixture model and thus for the jth variable. Dependencies between the variables
more mixture components are used when there are more are modeled only through the mixture. The variable c has
samples. Convergence to the exact local optimum typically a multinomial distribution with parameters π c that have a
takes significantly longer, but it is usually not necessary as Dirichlet prior with parameters uc
even nearly optimal results are often good enough in practice.
Although the presented IVGA model appears quite simple, p(c(t)|π c , H) = Multinom(c(t); π c ) (9)
several computational speedup techniques are needed for it to p(π c |uc , H) = Dirichlet(π c ; uc ). (10)
work efficiently enough. Some of these may be of interest in The use of a mixture model allows for both categorical
themselves, irrespective of the IVGA principle. In particular and continuous variables. For continuous variables the mixture
worth mentioning are the adaptive tuning of operation prob- is a heteroscedastic Gaussian mixture, that is, all mixture
abilities in the grouping algorithm (Sec. IV-B.1) as well as the components have their own precisions. Thus
model compression principle (Sec. IV-B.2).
By providing the source code of the method for public use p(xj (t)|θi,j , H) = N (xj (t); µi,j , ρi,j ), (11)
we invite others both to use the method and to contribute to ex-
where µi,j is the mean and ρi,j is the precision of the
tending it. A MATLAB package of our IVGA implementation
Gaussian. The parameters µi,j and ρi,j have hierarchical priors
is available at http://www.cis.hut.fi/projects/
ivga/. p(µi,j |µµj , ρµj , H) = N (µi,j ; µµj , ρµj ) (12)
p(ρi,j |αρj , βρj , H) = Gamma(ρi,j ; αρj , βρj ). (13)
VII. C ONCLUSION
For categorical variables, the mixture is a simple mixture
In this paper, we have presented the independent variable
of multinomial distributions so that
group analysis (IVGA) principle and a method for modeling
data through mutually independent groups of variables. The p(xj (t)|θi,j , H) = Multinom(xj (t); πi,j ). (14)
11
The probabilities π i,j have a Dirichlet prior A PPENDIX II
D ERIVATION OF THE C OST F UNCTION AND U PDATE RULES
p(π i,j |uj , H) = Dirichlet(π i,j ; uj ). (15) The cost function of Eq. (6) can be expressed, using h·i to
denote expectation over q, as
Combining these yields the joint probability of all paramet- D q(θ) E
T log
ers (here c = [c(1), . . . , c(T )] ): p(D, θ|H)
= log q(θ) − log p(θ) − log p(D|θ) (23)
Yh i
p(D, c, π c , π, µ, ρ) = p(c(t)|πc ) p(π c |uc )
Now, being expected logarithms of products of probability
t
"
h i distributions over the factorial posterior approximation q, the
terms easily split further. The terms of cost function are
Y Y
p(πi,j |uj )
i j:xj categorical
presented as the costs of the different parameters and the
h i
# likelihood term. Some of the notation used in the formulae
is introduced in Table III.
Y
p(µi,j |µµj , ρµj )p(ρi,j |αρj , βρj )
j:xj continuous
" Symbol Explanation
Y Y C Number of mixture components
p(xj (t)|c(t), π ·,j ) T Number of data points
t j:xj categorical Dcont Number of continuous dimensions
# Sj The number of categories in nominal dimension j
Y u0 The sum over the parameters of a Dirichlet distribution
p(xj (t)|c(t), µ·,j , ρ·,j ) (16) Ik (x) An indicator for x being of category k
j:xj continuous Γ The gamma function (not the distribution pdf)
d
Ψ The digamma function, that is Ψ(x) = dx ln(Γ(x))
wi (t) The multinomial probability/weight of the ith mixture
All the component distributions of this expression have been component in the w(t) of data point t
introduced above in Eqs. (11)-(15). TABLE III
The corresponding variational approximation is N OTATION
q(c, π c , π, µ, ρ) = q(c)q(π c )q(π)q(µ)q(ρ) =
Yh i
q(c(t)|w(t)) q(π c |ûc )
A. Terms of the Cost Function
t
"
Y Y h i
q(π i,j |ûi,j )
i j:xj categorical log q(c|w) − log p(c|πc ) =
#
Y h i T X
C
q(µi,j |µ̂µi,j , ρ̂µi,j )q(ρi,j |α̂ρi,j , β̂ρi,j )
X
(17)
wi (t) log wi (t) − [Ψ(ûci ) − Ψ(ûc0 )] (24)
j:xj continuous t=1 i=1
with the factors
log q(πc |ûc ) − log p(πc |uc ) =
C h
q(c(t)) = Multinom(c(t); w(t)) (18) X i
(ûci − uc )[Ψ(ûci ) − Ψ(ûc0 )] − log Γ(ûci )
q(π c ) = Dirichlet(π c ; ûc ) (19) i=1
q(π i,j ) = Dirichlet(π i,j ; ûi,j ) (20) + log Γ(ûc0 ) − log Γ(uc0 ) + C log Γ(uc ) (25)
q(µi,j ) = N (µi,j ; µ̂µi,j , µ̂ρi,j ) (21)
q(ρi,j ) = Gamma(ρi,j ; α̂ρi,j , β̂ρi,j ). (22)
log q(π|û) − log p(π|u) =
" C Sj
Because of the conjugacy of the model, these are optimal X XXh i
forms for the components of the approximation, given the (ûi,j,k − uj,k )[Ψ(ûi,j,k ) − Ψ(û0i,j )]
factorization. Specification of the approximation allows the j:xj categorical i=1 k=1
evaluation of the cost of Eq. (6) and the derivation of up- C h Sj
X X i
date rules for the parameters as shown below in Appendix + log Γ û0i,j − log Γ ûi,j,k
II. The hyperparameters µµj , ρµj , αρj , βρj are updated using i=1 k=1
maximum likelihood estimation. The parameters of the fixed h Sj i
#
X
Dirichlet priors are set to values corresponding to the Jeffreys + C − log Γ(u0j ) + log Γ(uj,k ) (26)
prior. k=1
12
1) Update w
wi∗ (t) ← exp Ψ(ûci )+
log q(µ|µ̂µ , ρ̂µ ) − log p(µ|µµ , ρµ ) =
C h
CDcont X X ρ̂µ X h i
− + log i,j Ψ(ûi,j,xj (t) ) − Ψ(û0i,j )
2 2ρµj
j:xj continuous i=1 j:xj categorical
ρµj −1 i 1
ρ̂µi,j + (µ̂µi,j − µµj )2
X h
+ (27) −
2 2
j:xj continuous
(30)
α̂ρi,j 2
ρ̂−1
µi,j + (xj (t) − µ̂µi,j )
β̂ρi,j
!
i
log q(ρ|α̂ρ , β̂ρ ) − log p(ρ|αρ , βρ ) = − Ψ(α̂ρi,j ) − log β̂ρi,j
X XC h
log Γ(αρj ) − log Γ(α̂ρi,j ) w∗ (t)
j:xj continuous i=1 wi (t) ← PC i
∗
i′ =1 wi′ (t)
+ α̂ρi,j log β̂ρi,j − αρj log βρj
α̂ρ i 2) Update ûc
+ (α̂ρi,j − αρj ) Ψ(α̂ρi,j ) − log β̂ρi,j + i,j (βρj − β̂ρi,j )
β̂ρi,j T
X
(28) ûci ← uc + wi (t) (31)
t=1
3) Update categorical dimensions of the mixture compon-
ents
T log(2π)Dcont
− log p(D|c, πc , π, µ, ρ) = T
2 X
T X
X C
( "
X h i ûi,j,k ← uj,k + wi (t)Ik (xj (t)) (32)
+ wi (t) − Ψ(ûi,j,xj (t) ) − Ψ(û0i,j ) t=1
t=1 i=1 j:xj categorical
h α̂ 4) Update continuous dimensions of the mixture compon-
1 X ρi,j 2
ρ̂−1 ents
+ µi,j + (xj (t) − µ̂µi,j )
2 β̂ρi,j
j:xj continuous α̂ρi,j PT
#) ρµj µµj + β̂ρi,j t=1 wi (t)xj (t)
i
− Ψ(α̂ρi,j ) − log β̂ρi,j µ̂µi,j ← α̂ρi,j PT (33)
ρµj + β̂ρi,j t=1 wi (t)
(29) T
α̂ρi,j X
ρ̂µi,j ← ρµj + wi (t) (34)
β̂ρi,j t=1
T
1X
B. On the Iteration Formulae and Initialization α̂ρi,j ← αρj + wi (t) (35)
2 t=1
The iteration formulae for one full iteration of mix- 1X
T
2
wi (t) ρ̂−1
ture model adaptation consist of simple coordinate-wise β̂ρi,j ← βρj + µi,j + (µ̂µi,j − xj (t))
2 t=1
re-estimations of the parameters. This is like expectation-
maximization (EM) iteration. The update rules of the hyper- (36)
parameters µµj , ρµj , αρj and βρj are based on maximum
likelihood estimation. ACKNOWLEDGMENT
Before the iteration the mixture components are initialized
using the dataset and a pseudorandom seed number that is used We would like to thank Zoubin Ghahramani for interest-
to make the initialization stochastic but reproducible using the ing discussions. We also wish to thank Valor Computerized
same random seed. The mixture components are initialized as Systems (Finland) Oy for providing us with the data used
equiprobable. in the printed circuit board assembly experiment. This work
was supported in part by the Finnish Centre of Excellence
Programme (2000–2005) under the project New Information
Processing Principles, and by the IST Programme of the
C. The Iteration Formulae European Community, under the PASCAL Network of Ex-
cellence, IST-2002-506778. This publication only reflects the
One full iteration cycle: authors’ views.
13
R EFERENCES [24] T. Joachims, “Making large-scale SVM learning practical,” in Advances
in Kernel Methods - Support Vector Learning, B. Schölkopf, C. Burges,
[1] K. Lagus, E. Alhoniemi, and H. Valpola, “Independent variable group and A. Smola, Eds. MIT-Press, 1999.
analysis,” in International Conference on Artificial Neural Networks - [25] S. Theodoridis and K. Koutroumbas, Pattern recognition, 2nd ed.
ICANN 2001, ser. LLNCS, G. Dorffner, H. Bischof, and K. Hornik, Eds., Academic Press, 2003.
vol. 2130. Vienna, Austria: Springer, August 2001, pp. 203–210. [26] J. Winn and C. M. Bishop, “Variational message passing,” Journal of
[2] K. Lagus, E. Alhoniemi, J. Seppä, A. Honkela, and P. Wagner, “Inde- Machine Learning Research, vol. 6, pp. 661–694, April 2005.
pendent variable group analysis in learning compact representations for [27] M. Harva, T. Raiko, A. Honkela, H. Valpola, and J. Karhunen, “Bayes
data,” in Proceedings of the International and Interdisciplinary Confer- Blocks: An implementation of the variational Bayesian building blocks
ence on Adaptive Knowledge Representation and Reasoning (AKRR’05), framework,” in Proceedings of the 21st Conference on Uncertainty in
T. Honkela, V. Könönen, M. Pöllä, and O. Simula, Eds., Espoo, Finland, Artificial Intelligence (UAI 2005), Edinburgh, Scotland, 2005, pp. 259–
June 2005, pp. 49–56. 266.
[3] J.-F. Cardoso, “Multidimensional independent component analysis,” in
Proceedings of ICASSP’98, Seattle, 1998.
[4] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description
length and Helmholtz free energy,” in Neural Information Processing
Systems 6, J. et al, Ed. San Mateo, CA: Morgan Kaufmann, 1994.
[5] R. S. Zemel, “A minimum description length framework for unsuper-
vised learning,” Ph.D. dissertation, University of Toronto, 1993.
[6] K. Viikki, E. Kentala, M. Juhola, I. Pyykkö, and P. Honkavaara,
“Generating decision trees from otoneurological data with a variable
grouping method,” Journal of Medical Systems, vol. 26, no. 5, pp. 415–
425, 2002.
[7] A. Tucker, S. Swift, and X. Liu, “Variable grouping in multivariate
time series via correlation,” IEEE Transactions on Systems, Man and
Cybernetics, Part B, vol. 31, no. 2, pp. 235–245, 2001.
[8] E. Segal, D. Pe’er, A. Regev, D. Koller, and N. Friedman, “Learning
module networks,” Journal of Machine Learning Research, vol. 6, pp.
557–588, April 2005.
[9] Y. Cheng and G. M. Church, “Biclustering of expression data,” in Pro-
ceedings of the Eighth International Conference on Intelligent Systems
for Molecular Biology (ISMB), 2000, pp. 93–103.
[10] S. C. Madeira and A. L. Oliveira, “Biclustering algorithms for biological
data analysis: A survey,” IEEE/ACM Transactions on Computational
Biology and Bioinformatics, vol. 1, no. 1, pp. 24–45, 2004.
[11] M. Studený and J. Vejnarová, “The multiinformation function as a tool
for measuring stochastic dependence,” in Learning in Graphical Models,
M. Jordan, Ed. Cambridge, MA, USA: The MIT Press, 1999, pp. 261–
297.
[12] M. Nilsson, H. Gustafsson, S. V. Andersen, and W. B. Kleijn, “Gaussian
mixture model based mutual information estimation between frequency
bands in speech,” in Proceedings of the IEEE International Conference
on Acoustics, Speech and Signal Processing 2002 (ICASSP ’02), vol. 1,
2002, pp. I–525–I–528.
[13] T. M. Cover and J. A. Thomas, Elements of Information Theory. New
York: Wiley, 1991.
[14] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An
introduction to variational methods for graphical models,” in Learning
in Graphical Models, M. Jordan, Ed. Cambridge, MA, USA: The MIT
Press, 1999, pp. 105–161.
[15] B. J. Frey and G. E. Hinton, “Efficient stochastic source coding and
an application to a Bayesian network source model,” The Computer
Journal, vol. 40, no. 2/3, pp. 157–165, 1997.
[16] J. Rissanen, “Modeling by shortest data description,” Automatica,
vol. 14, no. 5, pp. 465–471, 1978.
[17] A. Honkela and H. Valpola, “Variational learning and bits-back coding:
an information-theoretic view to Bayesian learning,” IEEE Transactions
on Neural Networks, vol. 15, no. 4, pp. 800–810, 2004.
[18] G. McLachlan and D. Peel, Finite Mixture Models. New York: Wiley,
2000.
[19] D. J. C. MacKay, Information Theory, Inference, and Learning Al-
gorithms. Cambridge: Cambridge University Press, 2003.
[20] E. Alhoniemi, T. Knuutila, M. Johnsson, J. Röyhkiö, and O. S.
Nevalainen, “Data mining in maintenance of electronic component
libraries,” in Proceedings of the IEEE 4th International Conference on
Intelligent Systems Design and Applications, vol. 1, 2004, pp. 403–408.
[21] M. J. Zaki, “Scalable algorithms for association mining,” IEEE Transac-
tions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 372–390,
2000.
[22] C. L. Blake and C. J. Merz, “UCI repository of machine learning
databases,” 1998, URL: http://www.ics.uci.edu/∼ mlearn/MLRepository.
html.
[23] D. Fradkin and D. Madigan, “Experiments with random projections
for machine learning,” in Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining.
New York, NY, USA: ACM Press, August 24-27 2003, pp. 517–522.