If you find the TeX codes below hard to read, the typeset version may be found in pp. 16–19 of this PDF file.
Christos Boutsidis Rensselaer Polytechnic Institute AN IMPROVED APPROXIMATION ALGORITHM FOR THE COLUMN SUBSET SELECTION PROBLEM Given an $m \times n$ matrix $A$ and an integer $k$, with $k \ll n$, the Column Subset Selection Problem (CSSP) is to select $k$ columns of $A$ to form an $m \times k$ matrix $C$ that minimizes the residual $\lVert A - CC^{+} A \rVert_{\xi}$, for $\xi=2$ or $F$. Here $C^{+}$ denotes the pseudoinverse of the matrix $C$. This combinatorial optimization problem has been exhaustively studied in the numerical linear algebra and the theoretical computer science communities. Current state-of-the-art approximation algorithms guarantee \begin{align*} \lVert A-CC^{+} A \rVert_2 &\le O(k^{1/2}(n-k)^{1/2}) \lVert A-A_k \rVert_2,\\ \lVert A-CC^{+} A \rVert_F &\le \sqrt{(k+1)!} \lVert A-A_k \rVert_F. \end{align*} Here, we present a new approximation algorithm which guarantees that with high probability \begin{align*} \lVert A-CC^{+} A \rVert_2 &\le O(k^{3/4}\log^{1/2}(k)(n-k)^{1/4}) \lVert A-A_k \rVert_2,\\ \lVert A-CC^{+} A \rVert_F &\le O(k \sqrt{\log k}) \lVert A-A_k \rVert_F, \end{align*} thus \emph{asymptotically} improves upon the previous results. $A_k$ is the best-rank-k approximation to $A$, computed with the SVD. Also note that if $A=U \Sigma V^T$ is the SVD of a matrix $A$, then $A^{+}=V \Sigma^{-1}U^T$. We also present an example on how this algorithm can be utilized for a low rank approximation of a data matrix. This is joint work with Michael Mahoney and Petros Drineas Pavel Dmitriev Yahoo! Inc. MACHINE LEARNING APPROACH TO GENERALIZED GRAPH PATTERN MINING There has been a lot of recent interest in mining patterns from graphs. Often, the exact structure of the patterns of interest is not known. This happens, for example, when molecular structures are mined to discover fragments useful as features in chemical compound classification task, or when web sites are mined to discover sets of web pages representing logical documents. Such patterns are often generated from a few small subgraphs (cores), according to certain generalization rules (GRs). We call such patterns ``generalized patterns''(GPs). While being structurally slightly different, GPs often perform the same function in the network. Previously proposed approaches to mining GPs either assumed that the cores and the GRs are given, or that all interesting GPs are frequent. These are strong assumptions, which often do not hold in practical applications. In this paper, we propose an approach to mining GPs that is free from the above assumptions. Given a small number of GPs selected by the user, our algorithm discovers all GPs similar to the user examples. First, a machine learning-style approach is used to find the cores. Second, generalizations of the cores in the graph are computed to identify GPs. Evaluation on synthetic and real-world datasets demonstrates the promise as well as highlights some problems with our approach. Charles Elkan University of California, San Diego AN EXPERIMENTAL COMPARISON OF RANDOMIZED METHODS FOR LOW-RANK MATRIX APPROXIMATION Many data mining and knowledge discovery applications that use matrices focus on low-rank approximations, including for example latent semantic indexing (LSI). The singular value decomposition (SVD) is one of the most popular low-rank matrix approximation methods. However, it is often considered intractable for applications involving massive data. Recent research has addressed this problem, with several randomized methods proposed to compute the SVD. We review these techniques and present the first empirical study of some of them. The projection-based approach of Sarlos appears best in terms of accuracy and runtime, although a sampling approach of Drineas, Kannan, and Mahoney also performs favorably. No method offers major savings when the input matrix is sparse. Since most massive matrices used in knowledge discovery, for example word-document matrices, are sparse, this finding reveals an important direction for future research. This is joint work with Aditya Krishna Menon, University of California, San Diego. Sreenivas Gollapudi Microsoft Research, Silicon Valley ESTIMATING PAGERANK ON GRAPH STREAMS This study focuses on computations on large graphs (e.g., the web-graph) where the edges of the graph are presented as a stream. The objective in the streaming model is to use small amount of memory (preferably sub-linear in the number of nodes $n$) and a few passes. In the streaming model, we show how to perform several graph computations including estimating the probability distribution after a random walk of length $l$, mixing time, and the conductance. We estimate the mixing time $M$ of a random walk in $\widetilde{O}(n\alpha+M\alpha\sqrt{n}+\sqrt{Mn/\alpha})$ space and $\widetilde{O}(\sqrt{M/\alpha})$ passes. Furthermore, the relation between mixing time and conductance gives us an estimate for the conductance of the graph. By applying our algorithm for computing probability distribution on the web-graph, we can estimate the {\em PageRank} $p$ of any node up to an additive error of $\sqrt{\epsilon p}$ in $\widetilde{O}(\sqrt{M/\alpha})$ passes and$ \widetilde{O}\Bigl( \min\Bigl\{n\alpha+\frac{1}{\epsilon}\sqrt{M/\alpha}+\frac{1}{\epsilon} M\alpha, \alpha n\sqrt{M\alpha} + \frac{1}{\epsilon}\sqrt{M/\alpha} \Bigr\}\Bigr)$ space, for any $\alpha \in (0,1]$. In particular, for $\epsilon = M/n$, by setting $\alpha = M^{-1/2}$, we can compute the approximate PageRank values in $\tilde{O}(nM^{-1/4})$ space and $\widetilde{O}(M^{3/4})$ passes. In comparison, a standard implementation of the PageRank algorithm will take $O(n)$ space and $O(M)$ passes. Mohammad Al Hasan Rensselaer Polytechnic Institute UNIFORM GENERATION AND PATTERN SUMMARIZATION Frequent Pattern Mining (FPM) is a mature topic in data mining with many efficient algorithms to mine patterns with variable complexities, such as set, sequence, tree, graph, and etc. But, the usage of FPM in real-life knowledge discovery systems is considerably low in comparison to their potentital. The prime reason is the lack of interpretability caused from the enormity of output-set size. A modest size graph database with thousands graphs can produce millions of frequent graph patterns with a reasonable support value. So, the recent research direction in FPM has shifted from obtaining the total set to generate a representative (summary) set. Most of the summmarization approaches that is proposed recently are post-processor; they adopt some form of clustering algorithm to generate a smaller representative pattern-set from the collection of total set of frequent patterns (FP). But, post-processing requires to obtain all the members of FP; which, unfortunately, is not practically feasible for many real-world data set. In this research, we consider an alternative viewpoint of representativeness. It is based on uniform and identical sampling, the most fundamental theory in data analysis and statistics. Our representative patterns are obtained with uniform probability from the pool of all frequent maximal patterns. Along this idea, we first discuss a hypothetic algorithm for uniform generation that uses a \#P-oracle. However, such an oracle is difficult to obtain as it is related to the counting of maximal frequent patterns, which was shown to be \#P-complete. Hence, we propose a random-walk based approach that traverse the FP partial order randomly with prescribed transition matrix. On convergence, the random walk visits each maximal pattern with a uniform probability. Since, the number of maximal patterns, generally, are very low compared to the search space, the idea of importance sampling is used to guide the random walk to visit the maximal patterns more frequently. Ling Huang Intel Research APPROXIMATE SUPPORT VECTOR MACHINES FOR DISTRIBUTED SYSTEM The computational and/or communication constraints associated with processing large-scale data sets using support vector machines (SVM) in contexts such as distributed networking systems are often prohibitively high, resulting in practitioners of SVM learning algorithms having to apply the algorithm on approximate versions of the kernel matrix induced by a certain degree of data reduction. In this research project, we study the tradeoffs between data reduction and the loss in an algorithm's classification performance. We introduce and analyze a consistent estimator of the SVM's achieved classification error, and then derive approximate upper bounds on the perturbation on our estimator. The bound is shown to be empirically tight in a wide range of domains, making it practical for the practitioner to determine the amount of data reduction given a permissible loss in the classification performance. Xuhui Huang Stanford University CONVERGENCE OF PROTEIN FOLDING FREE ENERGY LANDSCAPE VIA APPLICATION OF GENERALIZED ENSEMBLE SAMPLING METHODS ON A DISTRIBUTED COMPUTING ENVIRONMENT The Replica Exchange Method (REM) and Simulated Tempering (ST) enhanced sampling algorithms have become widely used in sampling conformation space of bimolecular systems. We have implemented a serial version of REM (SREM) and ST in the heterogeneous Folding@home distributed computing environment in order to calculate folding free energy landscapes. We have applied both algorithms to the 21 residue Fs peptide, and SREM to a 23 residue BBA5 protein. For each system, we find converged folding free energy landscapes for simulations started from near.native and fully unfolded states. We give a detailed comparison of SREM and ST and find that they give equivalent results in reasonable agreement with experimental data. Such accuracy is made possible by the massive parallelism provided by Folding@home, which allowed us to run approximately five thousand 100ns simulations for each system with each algorithm, and generates more than a million configurations. Our extensive sampling shows that the AMBER03 force field gives better agreement with experimental results than previous versions of the force field. Prateek Jain University of Texas, Austin METRIC/KERNEL LEARNING WITH APPLICATIONS TO FAST SIMILARITY SEARCH Distance and kernel learning are increasingly important for several machine learning applications. However, most existing distance metric algorithms are limited to learning metrics over low-dimensional data, while existing kernel learning algorithms are limited to the transductive setting and thus do not generalize to new data points. In this work, we focus on the Log Determinant loss due to its computational and modelling properties, and describe an algorithm for distance and kernel learning that is simple to implement, scales to very large data sets, and outperforms existing techniques. Unlike most previous methods, our techniques can generalize to new points, and unlike most previous metric learning methods, they can work with high-dimensional data. We demonstrate our learning approach applied to large-scale problems in computer vision, specifically image search. Pentti Kanerva Stanford University RANDOM INDEXING: SIMPLE PROJECTIONS FOR ACCUMULATING SPARSELY DISTRIBUTED FREQUENCIES We have applied a simple form of random projections, called Random Indexing, to English text, to encode the contexts of words into high-dimensional vectors. The context vectors for words can be used in natural-language research and text search, for example. Random Indexing is particularly suited for massive data that keeps on accumulating---it is incremental. It compresses a large sparse $M \times N$ matrix, with $M$ and $N$ growing into the billions, into a much smaller matrix of fixed size with little information loss. Possible applications include network connectivity analysis: who is talking to whom? Simon Lacoste-Julien University of California, Berkeley DISCLDA: DISCRIMINATIVE LEARNING FOR DIMENSIONALITY REDUCTION AND CLASSIFICATION Probabilistic topic models (and their extensions) have become popular as models of latent structures in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood estimation, an approach which may be suboptimal in the context of an overall classification problem. In this work, we describe DiscLDA, a discriminative learning framework for such models as Latent Dirichlet Allocation (LDA) in the setting of dimensionality reduction with supervised side information. In DiscLDA, a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood using Monte Carlo EM. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroup document classification task. This is joint work with Fei Sha and Michael Jordan. Taesup Moon Stanford University DISCRETE DENOISING WITH SHIFTS We introduce S-DUDE, a new algorithm for denoising DMC-corruputed data. The algorithm extends the recently introduced DUDE (Discrete Universal DEnoiser) of Weissman et. al., and aims to compete with a genie that has access to both noisy and clean data, and can choose to switch, up to $m$ times, between sliding window denoisers in a way that minimizes the overall loss. The key issue is to learn the best segmentation of data and the associated denoisers of the genie, only based on the noisy data. Unlike the DUDE, we treat cases of one- and two-dimensional data separately due to the fact that the segmentation of two-dimensional data is significantly more challenging than that of one-dimensional data. We use dynamic programming and quadtree decomposition in devising our scheme for one- and two-dimensional data, respectively, and show that, regardless of the underlying clean data, our scheme achieves the performance of the genie provided that $m$ is sub-linear in the sequence length $n$ for one-dimensional data and $m\ln m$ is sub-linear in $n$ for two-dimensional data. We also prove the universal optimality of our scheme for the case where the underlying data are (spatially) stationary. The resulting complexity of our scheme is linear in both $n$ and $m$ for one-dimensional data, and linear in $n$ and exponential in $m$ for two-dimensional data. We also provide a sub-optimal scheme for two-dimensional data that has linear complexity in both $n$ and $m$, and present some promising experimental results involving this scheme. Joint work with Professor Tsachy Weissman (Stanford). Ramesh Nallapati Stanford University JOINT LATENT TOPIC MODELS FOR TEXT AND CITATIONS In this work, we address the problem of joint modeling of text and citations in the topic modeling framework. We present two different models called the Pairwise-Link-LDA and the Link-PLSA-LDA models. The Pairwise-Link-LDA model combines the ideas of LDA and Mixed Membership Block Stochastic Models and allows modeling arbitrary link structure. However, the model is computationally expensive, since it involves modeling the presence or absence of a citation (link) between every pair of documents. The second model solves this problem by assuming that the link structure is a bipartite graph. As the name indicates, Link-PLSA-LDA model combines the LDA and PLSA models into a single graphical model. Our experiments on a subset of Citeseer data show that both these models are able to predict unseen data better than the base-line model of Erosheva and Lafferty, by capturing the notion of topical similarity between the contents of the cited and citing documents. Our experiments on two different data sets on the link prediction task show that the Link-PLSA-LDA model performs the best on the citation prediction task, while also remaining highly scalable. In addition, we also present some interesting visualizations generated by each of the models. Hariharan Narayanan University of Chicago ANNEALED ENTROPY, HEAVY TAILED DATA AND MANIFOLD LEARNING In many applications, the data that requires classification is very high dimensional. In the absence of additional structure, this is a hard problem. We show that the task of classification can tractable in two special cases of interest, namely \textbf{1.}~when data lies on a low dimensional submanifold on which it has a bounded density, and \textbf{2.}~when the data has a certain heavy tailed character. More precisely, we consider the following questions: \textbf{1.}~What is the sample complexity of classifying data on a manifold when the class boundary is smooth? \textbf{2.}~What is the sample complexity of classifying data with Heavy-Tailed characteristics in a Hilbert Space $H$ using a linear separator? In both cases, the VC dimension of the target function class is infinite, and so distribution free learning is impossible. In the first case, we provide sample complexity bounds that depend on the maximum density, curvature parameters and intrinsic dimension but are independent of the ambient dimension. In the second case, we obtain dimension independent bounds if there exists a basis ${e_1, e_2, \ldots}$ of $H$ such that the probability that a random data point does not lie in the span of ${e_1, \dots, e_k}$ is bounded above by $C k^{-a}$ for some positive constants $C$ and $a$. Cases \textbf{1} and \textbf{2} are based on joint work with Partha Niyogi and Michael Mahoney, respectively. Stefan Pickl Naval Postgraduate School, Monterey and Bundeswehr University, Munich ALGORITHMIC DATA ANALYSIS WITH POLYTOPES: A COMBINATORIAL APPROACH A research area of central importance in computational biology, biotechnology --- and life sciences at all --- is devoted to modeling, prediction and dynamics of complex expression patterns. However, as clearly understood in these days, this enterprise cannot be investigated in a satisfying way without taking into account the role of the environment in its widest sense: Complex Data Analysis, Statistical Behaviour and Algorithmic Approaches are in the Main Center of Interest: To a representation of past, present and most likely future states, this contribution also acknowledge the presence of measurement errors and further uncertainties. This poster contribution surveys and improves recent advances in understanding the mathematical foundations and interdisciplinary implications of the newly introduced gene-environment networks. Moreover, it integrates the important theme carbon dioxide emission reduction into the context of our networks and their dynamics. Given the data from DNA microarray experiments and environmental records, we extract nonlinear ordinary differential equations which contain parameters that have to be determined. This is done by some modern kinds of (so-called generalized Chebychev) approximation and (so-called generalized semi-infinite) optimization. After this is provided, time-discretized dynamical systems are studied. Here, a combinatorial algorithm constructing and following polyhedra sequences, allows to detect the region of parametric (in)stability. Finally, we analyze the topological landscape of gene-environment networks in its structural stability. With the example of $CO_2$-emission control and some further statistical perspectives we conclude. Jon Shlens Miller Institute, University of California, Berkeley and Salk Institute for Biological Studies EXPLORING LARGE NETWORKS OF REAL NEURONS USING MARKOV RANDOM FIELDS AND GENERALIZED LINEAR MODELS All visual signals from the eye to the brain originate in the electrical activity of a population of cells in the retina, termed retinal ganglion cells (RGCs). Standard models implicitly assume that RGCs signal visual information independently of one another. However, several studies have demonstrated that significant correlated activity amongst RGCs may fundamentally alter visual signals. We recorded the electrical activity of several hundred RGCs in retina to explore how network interactions effect the encoding of visual information. To test whether concerted firing can be explained by pairwise interactions, we used a maximum entropy approach borrowed from statistical mechanics to predict concerted activity. The nearest neighbor Ising model accurately reproduced the data, revealing that higher-order correlations have minimal impact on network activity within a neural population. The nearest-neighbor structure was explored further by employing a generalized linear model to additionally capture the dynamics and stimulus dependencies of the population of neurons. We employed MCMC techniques to construct a Bayesian estimate of the visual stimulus. We found that network activity contributed a valuable visual signal: ignoring network activity sacrifices roughly 20\% of the visual signal. Future investigations will examine how these results scale to a complete recording of several hundred neurons in the presence to stochastic stimuli matched to the statistics of the natural environment. Joint work with Jonathan Pillow, Liam Paninski, Greg Field, Jeff Gauthier, Martin Greschner, Alexander Sher, Alan Litke, Eero Simoncelli and E.J.~Chichilnisky. Jian Sun Stanford University GLOBAL AND LOCAL INTRINSIC SYMMETRY DETECTION Within the general framework of analyzing the properties of shapes which are independent of the shape~Rs pose, or embedding, we have developed a novel method for efficiently computing global symmetries of a shape which are invariant up to isometry preserving transformations. Our approach is based on the observation that the intrinsic symmetries of a shape are transformed into the Euclidean symmetries in the signature space defined by the eigenfunctions of the Laplace-Beltrami operator. We devise an algorithm which detects and computes the isometric mappings from the shape onto itself. The aforementioned method works very well when the whole object is intrinsically symmetric. However, in practice objects are often partially symmetric. To tackle this problem, we have developed an algorithm that detects partial intrinsic self-similarities of shapes. Our method models the heat flow on the object and uses the heat distribution to characterize points and regions on the shape. Using this method we are able to mathematically quantify the similarity between points at certain scale, and determine, in particular, at which scale the point or region becomes unique. This method can be used in data analysis and processing, visualization and compression. John Wu Lawrence Berkeley National Lab FASTBIT: AN EFFICIENT INDEXING TOOL FOR MASSIVE DATA FastBit is a software tool for searching large read-only datasets. It organizes user data in a column-oriented structure which is efficient for on-line analytical processing (OLAP), and utilizes compressed bitmap indexes to further speed up query processing. We have proven the compressed bitmap index used in FastBit to be theoretically optimal for one-dimensional queries. Compared with other optimal indexing methods, bitmap indexes are superior because they can be efficiently combined to answer multi-dimensional queries whereas other optimal methods can not. In this poster, we show that FastBit answers queries 10 times faster than a commercial database system, and highlight an application where FastBit sped up the molecular docking software hundreds of times. Zuobing Xu University of California, Santa Cruz DIRICHLET COMPOUND MULTINOMIAL MODELS FOR RETRIEVAL AND ACTIVE RELEVANCE FEEDBACK We describe new models, including those based on the Dirichlet Compound Multinomial (DCM) distribution and for information retrieval, relevance feedback, and active learning. We first indicate how desirable properties such as search engine ranking being concave in word repetition can be obtained through the use of the DCM model. We also describe several effective estimation methods. Reduction of the state space is achieved through the use of relevance and non-relevance sets. Effective capture of negative feedback and modeling of overlap terms between negative and positive feedback (relevant) documents is thus enabled. We conclude with a discussion of the computation efficiency and demonstration of the very good performance of these new methods on TREC data sets. This work also enables the efficient use of the original probabilistic ranking approach, while incorporating estimation methods developed in the language model framework. This is joint work with Ramakrishna Akella, University of California, Santa Cruz.