Data Management & Analysis
Visual Data Mining and Classification Software for Mass Spectrum Data
PIs: Klaus Mueller, SBU, Alla Zelenyuk, PNL
Visualization is a powerful paradigm to present, explore, and analyze a huge amount of data in a comprehensible way on a single display. We have so far developed a set of visual tools and techniques tailored to the exploration of aerosol data, and have coupled these with statistical data analysis routines to group, organize, and mine the data to discover trends, unique events, associations, and patterns. However, the tools and techniques developed are general enough to be applicable for a wide range of data mining applications, far beyond that of mass spectral data. In the past years of this effort we have developed:
1) an interactive focus+context viewer for the hierarchy of classified mass spectrum data,
2) a variety of tools and additional interactive windows that show the composition of the hierarchy nodes, and
3) a viewer for multi-attribute hierarchical time-series.
In the upcoming year we will develop:
1) a visualization tool that groups the class hierarchy nodes with respect to their distance in higher-dimensional space and also shows their individual statistics in terms of the participating chemicals,
2) the ability to incorporate more sophisticated clustering and classification methods based on machine-learning, and
3) an interface and various means to allow scientists to inject their domain knowledge, expertise, and intuition, to control and steer the clustering process.
All of these will significantly enhance and complement our set of visualization tools and paradigms, and they will enable our domain scientists to point out misclassified spectra and help the system to learn these patterns and train the system. (Battelle Pacific Northwest National Laboratory)
Dynamic Approach for Face Recognition using Digital Image Skin Correlation
PIs: Klaus Mueller, Miriam RafailovichÂ
With the recent emphasis on homeland security, there is an increased interest in accurate and non-invasive techniques for face recognition. Most of the current techniques perform a structural analysis of facial features from still images. Recently, video-based techniques have also been developed but they suffer from low image-quality. We propose a new method for face recognition, called Digital Image Skin Correlation (DISC), which is based on dynamic instead of static facial features. DISC tracks the fine-scale motion of the face during a facial expression and obtains a vector field that characterizes the deformation of the face. Since it is almost impossible to imitate another person’s facial expressions these deformation fields are bound to be unique to an individual. To test the performance of our method in face recognition scenarios, we have conducted experiments where we presented individuals wearing heavy make-up as disguise to our DISC matching framework. The results show superior face recognition performance when compared to the popular PCA+ LDA method, which is based on still images. (New York State Strategic Center for Port and Maritime Security, SUNY Maritime)
A Deductible Engine for the Semantic Web
PIs: M. Kifer, Annie Liu, C.R. Ramakrishnan and I.V. Ramakrishnan
This project is developing a deductive engine which will be the logical infrastructure for supporting the emerging infrastructure for the Semantic Web. It is based on our XSB Tabled Logic Programming System. The following research problems are being addressed in order to develop the deductive engine: are being undertaken as part of the research project: 1. Design of algorithms for solving a long-standing problem of inconsistency between tabling and update operations. 2. Development of algorithms that enable engine-level support for reasoning that involves equality maintenance. 3. Designing an architecture for supporting explanation-based reasoning at the engine level. 4. Development of answer tabling strategies where tables can be partially stored in secondary memory. (NSF)
Content-Driven Techniques for Non-Visual Web Access
PI: IV Ramakrishnan
The World Wide Web has evolved into an indispensable medium for dissemination of information, entertainment, commerce and education. However, the graphical nature of most browsing software as well as the diversity and complexity of web content has limited access to this technology for an entire community of persons with visual disabilities. Existing audio browsers that are based on text-to-speech conversion (e.g. screen readers) are not capable of describing the conceptual organization of a document's content or of letting a user select parts of a document to listen to. As a result, persons with visual disabilities can find it difficult to understand the organization of documents (such as being able to distinguish topics, correlate similar items, etc.), and waste considerable time and attention listening to irrelevant information. This project is developing HearSay, a system that will bring the browsing experience of persons with visual disabilities closer to that of sighted people. HearSay will be based on automated techniques for structuring the content of web documents into labeled partitions consisting of logically related items. By enabling interactive speech-driven guided exploration, in which the system presents the document's labeled content, and the user selects which parts of the content to listen to and when to navigate to a new page, HearSay will make non-visual browsing far less cumbersome. Furthermore for repetitive browsing tasks, HearSay will let users create and retrieve personalized content in different ways, ranging from content-based voicemarking of selected partitions in a page to powerful personal information assistants that gather and present user-defined information at the user's command. The ability to browse the Web using alternative modalities as will be facilitated by HearSay, will offer significant benefits not only to users with disabilities, but also to mobile users of hand-held devices. (NSF)
A System for Discovering Bioengineered Threats by Knowledge Base Driven Mining of Toxin DataÂ
PIs I.V. Ramakrishnan and M. Kifer, Subramanyam Swaminathan, BNL
The overall goal of this work is to establish a Toxin Knowledge Base — a bioinformatics resource primarily focused on molecular information about toxins and other virulence factors that are the natural products of biological and potential biological warfare agents. The resource will be mined to assimilate, synthesize, analyze and disseminate genomic and structural information on genes of these agents and their products. Advances in recombinant DNA technology have opened up possibilities for production of bioengineered pathogens or their products on scales that could make them formidable weapons of bioterrorism. Chimeric molecules form another kind of threat wherein the virulent domain of a toxin is hidden in what is otherwise a non-pathogenic protein. In this project we propose to collect all relevant information pertaining to toxins at molecular level and expand the existing Toxin Knowledge Base to identify potential virulence factors. Using advanced machine learning a nd data mining we will mine the database to look for motifs, to design new experiments and also to predict structure and function of molecules (including putative chimeras) for which these data are not available. Knowledge learned from this and similar analysis will be encoded as rules in an expert system. Both the database and its front-end expert system will be used for analyzing genomic data to identify specific regions that encode factors that contribute to virulence. (US Army Medical Research and Material Command)
Sequence Assembly for High-Throughput Technologies
PI: Steven Skiena
Significantly cheaper de novo sequencing technologies must be developed to fully understand the diversity of life. Although the technologies currently being developed focus on the easier problem of resequencing to study human variation, we are convinced that coupling them with significant advances in computational sequence assembly has the potential to dramatically reduce the cost of de novo sequencing as well. We are building assembly programs for sequencing technologies slated to become commercially available over the next one to three years. Certain technologies achieve their cost efficiency by producing massive numbers of very short reads cheaply, while others provide extremely long reads (albeit with high error rates). Our research lies in our application of mathematical analysis, algorithm design, and software implementation to the assembly problem for new sequencing technologies:Â
1) Technological Analysis -- We are studying the tradeoffs between read length, sequencing error rate, and coverage for both of these models to determine the technological parameters where de novo sequencing becomes achievable. Such analysis is essential to aim technology development in the most productive directions, and in designing experimental protocols.Â
2) Algorithm Design -- Today's sequence assembly algorithms are based on certain assumptions of read length and base-error rate which will not hold for these new technologies. The paradigms of sequence assembly will shift dramatically in the face of the extremely high coverage of the new technologies, becoming more statistical in flavor. We are developing the new classes of algorithms needed for tomorrow's assemblers.Â
3) Assembler Development -- Finally, are building sequence assemblers suitable for both classes of technologies capable of scaling up to mammalian-sized genomes. Open source assemblers will track sequencing technologies as they develop, to ensure that de novo sequencing experiments will be performed as soon as the technology permits.Â
Broader impacts of the proposed research include the significant medical, economic, and scientific advances which result from reducing the costs of de novo genome sequencing. (NSF)  Â
Lydia: Text Analysis for News and Blogs
PI: Steven SkienaÂ
The Lydia project seeks to build a relational model of people, places, and things through natural language processing of news sources and the statistical analysis of entity frequencies and co-locations. Lydia is already producing interesting analysis of significant volumes of online text sources. Indeed, we encourage the reader to visit our websites for our latest analysis of roughly 500 U.S. daily newspapers (www.textmap.org) and hundreds of thousands of blog postings (www.textblg.org). Our news analysis is quite different from that of aggregators such as Google News. We track the temporal and spatial distribution of the entities in the news: who is being talked about, by whom, when, and where? This enables us to identify important and interesting trends through statistical techniques. Similar techniques can be applied to collections of scientific abstracts such as Pubmed/Medline. Blogs represent an interesting new frontier for text analytics. More than just text, they provide significant structural information about the author, such as precise timestamps, geographical location, age, gender, and explicit friendship links. They also provide a forum for a much larger and potentially representative group of correspondents than conventional media. Our analysis helps to shed light on questions of whether the conventional news media leads or lags popular opinion as expressed in blogs. How often do bloggers report a story before newspapers? And conversely, how often do bloggers react to news that has already been reported? (NSF)
Computational Analysis of Genomic Sequence Tags
PI: Steven Skiena
There is currently no effective technology to assay the relative abundance of complex microbial communities. Probe-based methods such as microarrays can only hope to detect species which have already been at least partially sequenced; but these represent a small fraction of the millions of microbial species. The genomic sequence tag (GST) approach, pioneered by our collaborators at Brookhaven, promises to make such analysis possible for the first time. The success of the GST method largely depends upon the degree to which computational analysis can identify microbial species from very limited experimental data. GST technology has important applications in many areas of the life sciences, but particularly in ecological and medical research. Homeland security applications revolve around detection of novel pathogens, particularly previously unsequenced or genetically altered bacterial strains. (BNL)
IBM Supercomputer p650 for Modeling of Proteins, Network and Algorithm Design Â
PI:Â Yuefan Deng
Research staff and graduate students at Stony Brook University and Brookhaven National Laboratory have received a 32-processor with approximately 300 Gflops of floating-point operation capability computer from IBM as the Shared University Research grant. Since its installation around September 2005, they have enabled several projects on SUR computer:Â
1) Applying a molecular dynamics package we develop for biomolecular structures. This project requires simulation of a biotoxin protein with 100,000 atoms in solvents for several microseconds. Â
2) Analyzing p650 architecture and comparing it with the novel architectures such as BlueGene/L and QCDOC supercomputers. The main goal of the project is to propose new architectures that are more suitable for floating-operation-dominant scientific and engineering applications.
3) Developing parallelization algorithms for p650 and gain insights for designing more efficient algorithms with less programmer efforts for more complex architectures such as BlueGene/L and QCDOC supercomputers. (IBM)
Information Technology Projects
PI: Robert C. Rizzo Â
Dr. Rizzo's research team seeks to understand the basis for molecular recognition at the atomic level for specific biological systems involved in human disease, such as influenza, SARS, and HIV/AIDS, with the ultimate goal of developing new and improved drugs. Atomic level structural and energetic information available from computer simulations is critically important for understanding how molecules interact with a given disease target. His group is developing improved procedures for ranking the affinity of compounds, complexed with a biological target, in which the structures have been obtained using high-throughput computational screening (docking) calculations. Improved computational methods have great potential to save billions of dollars in drug development costs and reduce the time associated with bringing clinically useful medicines to market.  Â
1)        Structure-based drug design targeting HIV—Discover specific small molecules that inhibit HIVgp41 mediated cell membrane fusion. Computational high throughput screening (docking) will be used to rank hundreds of thousands of commercially available compounds (ligands) for binding affinity to a conserved hydrophobic pocket on HIVgp41 (receptor). The approach employs successive refinement of results from docking, clustering, rescoring with MM-GBSA to include desolvation, property-based filtering, diversity analysis, and visual inspection to select the most promising candidate ligands for testing by our experimental collaborators.
2)        Development of Improved Scoring Functions—Develop improved receptor-ligand scoring functions for docking. To improve the accuracy of results from docking calculations to HIVgp41 and other biological targets, new energy scoring functions will be developed and tested. New functions will include terms that estimate important desolvation and entropic effects associated with receptor-ligand complexation and will be optimized and tested using energetic and structural criteria. Â
3)        Database Development—Develop databases for testing and validation of computational methods for drug design. Validating computational methods and optimizing procedures requires comparison with known experimental information. We will develop, maintain, and make publicly available, three molecular databases to help facilitate structure-based drug design. Database-01 will consist of commercially available ligands containing accurate ab initio derived partial atomic charges. Database-02 will consist of parent-receptor (experimental) with associated analog-receptor (computational) complexes. Database-03 will consist of small organic molecules and ions with associated free energies of hydration. Â
(NYSTAR James D. Watson Award, Applied Math & Statistics / SBU)Â
Computational Biology of G-protein Signal Transduction: From Biophysics to Systems BiologyÂ
PI: David Green Â
It is becoming increasingly clear that biology sits at the edge of a major paradigm shift as the power of state of the art computational approaches are brought to bear on the problems posed by modern biology. We are actively working in this area, using multidisciplinary computational methods to develop a deeper understanding of how the physical interactions between proteins in a cell affect the cell's behavior.  Many biologically and medically important phenomena are ultimately a result of the interactions between networks of proteins within and between the cells of an organism. One large family of these networks is the G-protein signaling pathway, a network that functions to propagate signals from outside a cell to the cellular machinery that governs a cell's behavior. Many variants of this network exist in different cells and organisms, with roles in sensory input, development and the immune response, among others. When these pathways malfunction, the results can be devastating: many cancers involve mutations that alter the behavior of G-protein signaling pathways; cholera is a result of a bacterial toxin that interferes with a natural G-protein network in the cells lining the intestine. We are studying these networks of interactions using a diverse set of computational tools.Â
Structures are known for several components of the network, including the so-called G-proteins themselves. Using these structures as a starting point for detailed calculations of the chemical and physical interactions present provides a deeper understanding of what makes the proteins interact, information that may be useful in the development of pharmaceuticals targeting these interactions. Using algorithms adapted from the field of machine learning and artificial intelligence, we are working to extend these studies to related systems where the structures may not be known. This approach allows the power of computational methods to move beyond analysis of existing experimental data. Â
Even the simplest single-celled organism is much more than the sum of individual interactions, and it is important to consider larger biological systems to understand these emergent properties. We are developing mathematical models to describe the full behavior of the G-protein signaling networks. Combining models based on fundamental biochemical rules and the observations of experimental biology with advanced computational techniques for analyzing networks provides a framework from which to extract the key elements that lead to particular network behaviors. Perhaps most importantly, the group is working to develop models that bridge these two approaches of study — learning how the details of individual interactions affect the network as a whole. (AMS)  Â
Information Technology Projects
PI: John ReinitzÂ
The Reinitz lab is dedicated to the solution of fundamental problems in molecular genetics and large scale optimization. We work on two problems: developmental pattern formation and transcription.  With respect to pattern formation, we are trying to understand how animals form the blueprint for their body pattern, using the blastoderm of the fruit fly Drosophila melanogaster as a model system. We do this by formulating a model of gene regulation in the embryo, expressed as a system of nonlinear ordinary or partial differential equations. These equations are fit to gene expression data by large scale optimization methods based on simulated annealing or optimal control theory. We are also trying to understand the molecular mechanisms of metazoan transcription, with special emphasis on understanding how collections of binding sites give rise to modular enhancers. This problem is addressed by feed forward models fit to experimental data from P-transformed Drosophila embryos. Obtaining the data and the development of optimization methods are themselves important research problems for the lab.
Our data is available on the web in the FlyEx database, located in Russia and here at Stony Brook. The group has space both in the Math Tower and in the Center for Developmental Genetics at the Centers for Molecular Medicine (http://www.sunysb.edu/ovprpub/cmm/devgenetics.html), where Dr. Carlos Alonso of AMS is Director of the CMM Confocal Imaging Facility. We collaborate closely with David H. Sharp's group in the Theoretical Division of Los Alamos National Laboratory and Maria Samsonova's group at the State Polytechnical University in St. Petersburg, Russia(NIH)
Efficient Modeling and Analysis of Excitable Cell Networks using Hybrid Automata
PIs: Emilia Entcheva, Radu Grosu and Scott SmolkaÂ
Systems biology is an emerging multidisciplinary field whose goal is to provide a systems-level understanding of biological systems by uncovering their structure, dynamics and control methods. While many exciting and profound advances have been made in investigating robustness, network structures and dynamics, and application to drug discovery, the field is still in its infancy. An important open problem in systems biology is finding appropriate computational models that scale well for both the simulation and formal analysis of biological processes. Currently, the majority of these models are given in terms of large and complex sets of nonlinear differential equations, describing in painful detail the underlying biological phenomena. Although an invaluable asset for integrating genomics and proteomics data to reveal local interactions, such models are often not amenable to formal analysis and render simulation at the organ or even the cell level impractical. This project seeks to develop a hybrid-automata (HA) approach to modeling and analyzing complex biological systems. In particular, excitable cell networks (e.g. heart cells, muscular cells and nervous cells) will be used as an archetype of a complex biological system. (NSF)

