The Coordinating and Bioinformatics unit is responsible for the creation of the
software and informatics infrastructure for the consortium as well as facilitating
the efforts of the mouse engineering centers. This page provides information about
the infrastructure created for the consortium as well as any software created for
the scientific community.
Please click the links below for more information.
|
| Infrastructure Information |
AMDCC IT Infrastructure
Our programming paradigm is to develop software systems based on an n-tier architecture,
where we create the presentation layer, business logic and data layer into separate
software systems. These systems have been developed to minimize maintenance, but
provide a robust scalable model for future growth and interactions at the national
level with other organism databases. These systems have been designed using the
unified modeling language (UML) with the designs available to the general public.
The two UML modeling tools we use are Rational Rose and Powerdesigner.
AMDCC Data Model
The core relational data model for the AMDCC was created using SQL Server 2000 and
was based on a number of existing schemas containing our key subject areas: animal
models, genotypes (including array experiment data), histopathology, and phenotype
Assays. The Mouse Models of Human Cancer Consortium (MMHCC) and the Jackson Labs
were particularly helpful, and shared several successful models. Currently AMDCC
Data Model has been migrated to SQL Server 2005 and has been modified to include
MMPC (National Mouse Metabolic Phenotyping Centers) Data Schema. The current version
of the database addresses several domains, including AMDCC - MMPC administration,
models, strains, publications, external database references, experiments, phenotype
assays, microarray data, histology, images and dataset persistence. Current data
model has 250 tables, 55 functions, 994 stored procedures, 141 data views and a
total of 9344 lines of code.
AMDCC Administration Data Model
AMDCC Science Data Model
* Note: Above links require Internet Explorer version 5.0 or above to view Data Model
with Zoom capability. Also please make sure to accept ActiveX warning to start viewer.
Viewer has links to different data schemas on Navigation Dropdown Box, you will need to click go Next to the Links
to load different schema.
AMDCC Object Model
The AMDCC Object Model (AMDCC-OM) created for the consortium fully describes the
activities of the AMDCC and provides an OOP API to access the data generated by
the consortium. The AMDCC-OM was designed using Powerdesigner and UML, written in
C# and compiled as a .NET DLL. The object model contains both administrative and
domain specific classes. However, only the data centric classes are available to
the public. The Domain classes provide both object specific classes (e.g. Model,
Strain, Experiment, Protocol, etc.) as well as DataManager and SearchCriteria classes
used to retrieve data from the system. These DataManager classes are specific for
each of the data types maintained by AMDCC. For example, the StrainMgr class provides
methods to retrieve strain specific data. The SearchCriteria classes are also datatype
specific and are used by the DataManager classes to query the database using different
type specific parameters. For example, the StrainSearchCriteria class provides queryable
properties specific for the Strain data in the system.
AMDCC Object model base was modified to add MMPC (National Mouse Metabolic Phenotyping
Centers) schema. Currently common object model for both consortium contains classes
to serve AMDCC and MMPC consortium web portals.
In order to provide the broadest access to the data, we are also creating a WebService
that exposes specific portions fo the AMDCC-OM to the public. Specifically, the
WebService will provide access to all the object specific classes as well as the
DataManager and SearchCriteria classes. This provides a mechanisms for programmers
to create local AMDCC-OM objects in other languages. The current version of the
AMDCC-OM has 185 object classes.
AMDCC-Web Services
The AMDCC Web Services layer exposes classes and methods of the AMDCC object model
which can be used by users to interact with the AMDCC object model using custom
built web applications or even without a user interface. Details about the interfaces
are provided to users through an XML document called a Web Services Description
Language (WSDL) document. There are several tools available to read a WSDL file
and generate the code required to communicate with an XML Web service including
a very capable “Add Web Reference” tool used in Microsoft Visual Studio. AMDCC web
services layer makes available public data search and retrieval methods for animals,
strains, experiments, histology images, investigators, phenotype assays and publications.
The exposed web service methods can be consumed through customized client ASP.NET
applications using SOAP calls or through traditional HTTP GET/POST METHODS without
the use of an API. The framework has been designed to be independent of any particular
programming model and other implementation specific semantics. A complete documentation
for each of the web service methods is available providing information about data
return type, input parameters and exceptions thrown. In addition, users may choose
to download a zipped Visual Studio 2008 solution file containing a sample ASP .NET
client application and C# class library project.
|
|
| Software Applications |
ParaKMeans
ParaKMeans is a high performance parallel processing implementation of the K Means
Clustering algorithm. We designed the software so it can be deployed on most Windows
operating systems. The applications are written for the .NET Framework v1.1 using
the C# programming language. The parallel nature of the application comes from the
use of a web service to perform the distance calculations and cluster assignments.
Because we use a web service, it is essential that at least one computer has Internet
Information Services (IIS v.5 or better) installed and running. The parallel K Means
algorithm used in this application is based on the
work of Ben Zhang, Meichun Hsu and George Forman.
If you make use of the program presented here, please cite the following article:
Kraj P, Sharma A, Garge N, Podolsky R, McIndoe RA: ParaKMeans: Implementation of
a parallelized K-means algorithm suitable for general laboratory use. BMC Bioinformatics
2008;9:200.
|
HPCluster
Clustering is an unsupervised exploratory technique applied to microarray data to
find similar data structures or expression patterns. Because of the high I/O costs
involved and large distance matrices calculated, most of the clustering algorithms
fail on large datasets (30,000+ genes/200+ arrays). We propose a new two-stage algorithm
which partitions the high dimensional space associated with microarray data using
hyper planes. The first stage is based on the BIRCH (Balanced Iterative Reducing
and Clustering using Hierarchies) algorithm with the second stage being a conventional
k-Means clustering technique. Because the first stage traverses the data in a single
scan, the performance and speed increases substantially. The data reduction accomplished
in the first stage of the algorithm reduces the memory requirements allowing us
to cluster 44,460 genes without failure and significantly decreases the time to
complete when compared to popular k-Means programs. The software was written in
C# (.NET 1.1). This algorithm has been implemented in a software tool (HPCluster)
designed to cluster gene expression data.
If you make use of the program presented here, please cite the following article:
Sharma A, Podolsky R, Zhao J, McIndoe RA: A modified hyperplane clustering algorithm
allows for efficient and accurate clustering of extremely large datasets. Bioinformatics
2009;25:1152-1157.
|
ParaSAM: An application for significance
analysis of microarrays using a parallelized algorithm.
Significance Analysis of Microarrays (SAM) is a permutation-based method that relies
on estimating the FDR for determining significance. SAM is freely available as an
Excel plug-in and as an R-package module. However, for large datasets the memory
requirements are high and the algorithm fails. To overcome the memory limitations,
we have developed a parallelized version of the SAM algorithm called ParaSAM. This
high performance multithreaded application does not require programming experience
to run and is designed to provide the general scientific community with an easy
and manageable client-server Windows application. The parallel nature of the application
comes from the use of web services to perform the permutations. The software is
written in C# (.NET 1.1) and is designed in a modular fashion to provide both deployment
flexibility as well as flexibility in the user interface. Our results indicate ParaSAM
is not only faster than the serial versions, but can analyze extremely large datasets
that cannot be performed using a single PC.
If you make use of the program presented here, please cite the following article:
Sharma A, Zhao J, Podolsky R, McIndoe RA: ParaSAM: A parallelized version of the
significance analysis of microarrays algorithm. Bioinformatics 2010.
|
|
|
|