DISC Projects

The following are projects within the Data Intensive Scientific Computing REU:

Computer Science



 (VIS) Data Visualization

Professor Chaoli Wang

If "a picture is worth a thousand words", why not make data analytics visual for users so that they can easily explore data and assimilate information? In this NSF DISC REU program, students will gain valuable experience and practical skills in creating the visual display of various data sets to enable analytical reasoning. Project topics include the following: (1) Visualization of Student Learning Data: This project aims to analyze and visualize the learning data gathered from Notre Dame students who took the Moreau First Year Experience course and design effective visual means to identify trends and detect anomalies. (2) Visualization of Microglia Atlas: This project aims to produce the microglia atlas based on the microglia cells segmented and tracked at the whole-animal level and present a user-friendly atlas that enables analytical understanding of glial diversity and cell stability. (3) Pedagogical Tool for Data Visualization: This project aims to develop an interactive web-based educational tool that helps novices learn tree and graph visualizations, which are the most widely used visual forms for representing and exploring various data relationships. These projects require programming skills in C/C++ or JavaScript and preferably a graphics programming language such as OpenGL, WebGL, GLSL, or D3.js.

(CYBER) Distributed Data and Cloud Based Cyberinfrastructure for Science

Professor Paul Brenner

Professor Paul Brenner’s research addresses the growing economic, social and environmental costs to provision, power and operate cyberinfrastructure in the United States; the utility costs alone are forecasted to exceed 7 billion dollars. The Environmentally Opportunistic Computing (EOC) approach integrates computing hardware with existing facilities and societies to create heat where it is needed, exploit available free cooling, utilize power wherever it is least expensive, and to provide computational capability for science at the optimal $/FLOP or $/TB ratio. These distributed infrastructures are inherently data locality and network capability sensitive. REU participants will have the opportunity to develop and advance data and network aware CI tool sets leveraging EOC to provide sustainable CI for computational science. The project requires familiarity with the Linux operating systems, and a willingness to tinker with virtual machines and system software to develop working systems.

(DISTSYS) Distributed Systems for Scientific Computing

Professor Douglas Thain

The Cooperative Computing Lab designs open source software that enables high productivity computing on thousands of machines harnessed from clusters, clouds, and grids. Examples include the Makeflow workflow system, the Work Queueexecution system, and the Parrot virtual filesystem. Learn how to use these tools run construct programs that run on hundreds to thousands of machines, then contribute to the design and development of this software used around the world for science and engineering. Some experience programming in C or Python is necessary. (We will also offer training on these tools for use by students working on other REU projects.)

(NETSCI) Network Science for Computational Biology

 Professor Tijana Milenkovic

Networks are everywhere! There exist social networks such as Facebook that link together Facebook users, technological networks such as the Internet that link together computers world-wide, and molecular networks that model interactions between genes and proteins in the cell. And networks are fun! For example, both graphite and diamond are composed of carbon atoms, but what gives them different properties (graphite being soft and dark, diamond being hard and clear) is the links between the atoms, i.e., the network. So, what do we do with networks? Prof. Milenkovic and her group (http://nd.edu/~cone/) mine real-world networks and molecular networks in particular. Potential summer projects include: 1) developing novel computational strategies (or algorithms) for network comparison and alignment; 2) developing novel algorithms for dynamic network analysis, and 3) applied questions related to studying aging and disease. Mathematical and programming skills are required, prior biological experience is not needed.

(FLoRIN) Fast Learning-free Reconstruction of Neural Circuits

Professor Walter Scheirer

Large-scale study of microscopic images for connectomics (the subfield of neuroscience concerned with reconstructing neural circuits in the brain) using machine learning is hindered by insufficient ground truth annotations and massive training times. The CVRL lab at Notre Dame has introduced the Fast Learning-free Reconstruction of Neural Circuits (FLoRIN) pipeline, a learning-free pipeline for sparse segmentation and reconstruction of neural volumes from multiple imaging modalities, to address this. Based on comparisons to existing annotated X-Ray and electron microscopy volumes, we have shown that FLoRIN reconstructions are of a higher quality than those created by state-of-the-art machine learning-based systems and can be created in a fraction of the time. For instance, FLoRIN segmented an entire rodent brain imaged at 4μm in under 20 hours. In this REU project, we envision work to integrate FLoRIN output with machine learning-based approaches to augment their performance, as well as an exploration of different animal model systems (e.g., spiders, butterflies, octopuses) using the software.

(GRAPH) Graph Computing

Professor Peter Kogge

A “graph” is a set of “vertices” representing unique entities, and a set of “edges” that connect them. Today, graphs are at the core of what we do on a daily basis. An example is a social network where the vertices are users of the system, and edges are relationships such as “likes”. This REU project is associated with an NSF-funded project to develop new graph computing benchmarks with real-world relevance. The particular tasks for this summer include developing sample graphs with multiple types of vertices that can be used to test the new benchmarks. Today the standard graph generator creates only graphs with a single type of vertex. This project would modify such generators to create graphs with two or more vertices that can be used with both current and new graph routines. As time permits, additional activities may involve performing scaling experiments using these generated graphs, and/or porting the graph benchmarks into some of the newest and most exciting graph programming languages such as Facebook’s Graph API.

(LOBSTER) Crunching LHC Data on 25K Cores with Lobster 

Professor Kevin Lannon

The need for computing resources at the Large Hadron Collider (LHC) is rapidly outpacing the available funding. The Large-scale Opportunistic Batch Submission Toolkit for Exploiting Resources (Lobster) software has enabled physicists analyzing data from the Compact Muon Solenoid (CMS) experiment at the LHC to leverage the roughly 25,000 CPU cores available as opportunistic resources through the Notre Dame Center for Research Computing (CRC). Lobster grew out of an REU project, and there are still many exciting directions in which it could be further developed from making Lobster more intelligent so that it can automatically adjust to changing to taking Lobster beyond ND's campus to run on computers around the world.

(MDAPP) Modern Data Analytics Approaches in Particle Physics 

HEP Faculty

Finding evidence for new particles in the data collected at the Large Hadron Collider (LHC) has been likened to finding a needle in a needle-stack. The particle physics community has developed computational tools to tackle this enormous challenge, and while the tools do an outstanding job of sifting through gigantic piles of data, they start to bog down during the final stages of analysis where quick turn-around time becomes critical. However, companies like Google have developed solutions that can search through data with lightning speed. In partnership with researchers at Hewlett-Packard, the REU student will explore the potential of the HP Vertica Analytics Platform to accelerate analysis of data from the Compact Muon Solenoid (CMS) experiment at the LHC.

(NMLAP) Novel Machine Learning Approaches in Physics

Over the past two decades, machine learning has become an established tool in physics. However, recent advances in other fields, such as image recognition or natural language processing have yet to be fully exploited within physics. For example, deep learning techniques first explored for image recognition have generated significant improvements over more conventional approaches within experimental particle physics. Another distinguishing factor for deep learning is the need for GPGPU programming techniques to accelerate the training process. REU students in both astrophysics and experimental particle physics will explore the potential for deep learning techniques when applied to research problems in those domains.

(ECOEVO) Rapid Evolution in Response to Rising Sea Level

Professor Jason McLachlan

Rapid environmental change is putting novel strains on organisms in the wild. Evolutionary change could help them accommodate new stresses, but rapid evolution is usually hard to document in natural systems. We revived seeds of a coastal sedge buried in sediments spanning the last century and found that plants have changed genetically over this time period. Genotypes from the past respond differently to changes in sea-level, salinity, and atmospheric CO2, three environmental factors that continue to change in their coastal habitat. The REU student will help us understand these joint evolutionary/ecosystem dynamics by fitting genotype, phenotype, and environmental changes to a dynamic model our lab is developing. The data come from field experiments that have taken place at the Smithsonian Environmental Research Center (SERC) over the past 30 years, and current greenhouse and field experiments from our lab. We fit our model using Bayesian hierarchical approaches, which allow us to estimate the current and future state of the eco-evolutionary system as it responds to ongoing environmental change. There are opportunities for REU students interested in eco-evolutionary dynamics, global environmental change, fitting data with statistical models, and model-data fusion.

(TRAITS) Identifying Bacteria Ecological Syndromes and Tradeoffs through Comparative Genomics

 Professor Stuart Jones

Bacteria can be found everywhere, from hydrothermal vents at the bottom of the ocean to your belly button. These microorganisms drive global biogeochemical cycles and may even drive your food cravings. Recent advances in DNA sequencing technology have provided access to the genome sequence of more than 40,000 bacterial species. In this project, students will work to identify genome-derived clues for where these bacteria live and how they make a living.

(EVGR) Epigenetic Variation and Gene Regulation

Despite the significant amount of information on genetic variation, clear causal links between genetic variation and the onset of complex phenotypes including human disease are often elusive. In large part, this difficulty in linking genetic factors to phenotypes is due to complex genotype x environment interactions. Increasing evidence suggests that epigenetic factors often play a crucial role in mediating genotype x environment interactions that ultimately trigger the onset of diseases such as cancer and arthritis as well as contributing to cellular senescence. We are using next-generation sequencing in a model system, the waterflea Daphnia, to elucidate the role of epigenetic variation in modulating the patterns of gene regulation in response to environmental stress. The REU student will map epigenetic patterns in natural populations to reference genomes and RNA-seq data to test hypotheses about the relationship between methylation and gene regulation. Identifying bacterial ecological syndromes and tradeoffs through comparative ge-nomics. The availability of bacterial whole genome sequences has grown exponentially over the last decade. Today, greater than 40,000 genome sequences are available in public databases. The field of microbial ecology is no longer data limited, and the challenge that faces the field now is how to most efficiently translate this data into ecological knowledge. The strategy my group has adopted is to combine comparative genomic approaches from microbiology and computer science with a trait-based ecological framework most often employed in plant ecology. In this project, we use machine-learning approaches to identify genomic markers of known ecophysiological traits of bacteria. Following the generation of these genomic traits, we can evaluate whether correlations exist between a suite of genomic traits, which would indicate ecological tradeoffs or syndromes. Additionally, associations between genomic traits and the distribution of bacterial through space and time can be evaluated. This general area of microbial trait-based ecology is rich with opportunities for summer students with interest in computer science or computational biology.

(MALARIA) Modeling Malaria Drugs using Genome-wide Transcriptional Response Profiles

Malaria is a global health concern, with over 200 million infections a year and 600,000 deaths. Through the use of drug treatment and vector control, malaria mortality has decreased 45% over the past decade. High-throughput drug screens have identified thousands of compounds with anti-malarial activity, however more research is needed to prioritize compounds for advance-ment into pre-clinical and clinical trials [37]. A large-scale methodology is needed to identify the mechanism of action (MOA) by which each of these drugs function to kill malaria parasites. We are developing methods to use transcriptional cellular responses of the deadly malaria parasite, Plas modium falciparum, to identify a set of genes and a pattern of expression that typifies a response to perturbation by a particular drug (a signature of MOA). We hypothesize that the transcriptional response profile of P. falciparum to drug perturbation is highly indicative of the intracellular tar- gets of those drugs within the parasite and that the information can be used to predict drug MOA when the response profile for an unknown drug is compared to the response profiles for drugs with known MOA. A MOA signature for a given drug consists of the RMA normalized expression value for each gene normalized by the mean expression value across all perturbations for each gene. This normalization procedure removes the non-specific stress responses of the parasites and is a critical step for the successful application of our method. In this way we can build a database using the MOA signature for the drugs with known MOA to identify target relationships for drugs with no known target. Specifically, we need to demonstrate that drug perturbations in P. falciparum have a discernable MOA signature, build a reference database of MOA signatures, and validate the MOA signature using genetic perturbations [39, 40]. Finally we must show that the signature of an individual drug perturbation is consistent across experiments. We anticipate building a web tool to house the MOA signature database and allow queries for candidate drugs.

(CHROMO) Chromosome Inversions and Adaptation

The REU student will be involved in performing computer simulation models of how inversions (stretches of chromosomes with reversed gene order relative to each other) may affect the potential for ecological adaptation and speciation to occur when gene flow is occurring between populations. This is a very highly debated issue on the field of speciation genomics and it is thought that the recombination reducing effects of inversions Generated from computer simulations of a two-deme model of ecological speciation. Transition from low (genic) to high (genomic) levels of divergence and reproductive isolation (A), distributions of allele frequency differences (B, E), linkage disequilibrium (C, F), and population subdivision, Fst (D, G). may help in the divergence of different set of adaptive genes, facilitating speciation. However, many conflicting results call this theory into question. The student will work to ascertain the parameter space when inversion could facilitate the speciation process and be involved in analyzing DNA sequence data from Rhagoletis pomonella (in collaboration with colleagues from the Univ. of Illinois, Rice University, the Univ. of Colorado, and Kansas State University) and Drosophila pseudoobscura (in collaboration with a research team at Duke University) fruit flies currently being generated to empirical test predictions of the inversion hypothesis.

(GENETIC) Developing High-density Genetic Maps in Non-model Systems

Understanding the genetic basis of phenotypic variation is central to evolutionary biology and human disease research. A powerful approach to make the connection between genome phenotype relies on a statistical analysis of the relationship between variation in the genome and variation in the phenotype. These analyses are referred to as Quantitative Trait Loci (QTL) analysis. This type of analysis is funda mental to modern agriculture and human disease research. QTL analysis requires a well-resolved genetic map which until recently were extremely difficult to construct in any non-model species. Recent advances in high-throughput sequencing now make it possible to establish high-density ge- netic maps in many species of ecological and evolutionary relevance. The REU student will work with sequence data to develop genetic maps and conduct QTL analysis in species ranging from freshwater invertebrates like the waterflea Daphnia, to important agricultural pests like Rhagoletis fruit flies, and long-lived hardwood trees in the Oak species complex. Coupled hydrologic and biogeochemical modeling of lake regions:. Lakes represent key biogeochemical hotspots globally. Across the Holocene, lakes have stored twice as much carbon in their sediments as is currently sequestered in terrestrial biomass. In addition, lakes and reservoirs process about one-third of the nitrogen and over half of the carbon exported from the terrestrial landscape annually. Unfortunately, the geographic, climatic, and hydrologic context of lakes is completely ignored when generating these estimates. Rather, global rates of carbon burial or release to the atmosphere are derived from average observed rates multiplied by global lake area. We are developing computationally scalable, coupled catchment-lake models to capture important spatial and temporal heterogeneity in lake biogeochemistry and greenhouse gas cycling. With recent NSF support we are augmenting these models to capture long-term, lagged responses of catchment land use. REU students with computational interests would be ideal for developing and testing these catchment modules for our existing regional model and testing these models with existing data on lake hydrology and biogeochemistry using model data fusion approaches