Case studies

Predicting Parasite Presence

Researchers were interested in creating a diagnostic test to identify the presence or absence of a parasite in patients. The data available consisted of 450,000 peptides from infected and healthy patients. Before analysis, the data was first normalized and quality controlled. To develop a classifier, we first looked at the peptides individually to identify which were differentially expressed between healthy and infected patients.

The most differentially expressed peptides were identified using the Bioconductor ‘limma’ package. We then searched for a signature for selecting a small but informative subset of these peptides, based on machine learning methods, as linear discriminant analysis, random forest, svm, logistic regression function, and with increasing number of features. This methodology, when tested via tenfold cross validation, yielded a perfect classification rate across all provided data with the optimal number of features and classification method used.

Transcriptomics Pipeline

A client was generating large volumes of microarray data, but were bottle-necked by a manual analysis process that required several days of labor per dataset. We developed a pipeline that utilized modular template documents to streamline the analysis and reporting process. Our customized framework allows the user to specify parameters linked to model specification, visualizations and analysis methodologies.

For a foundation, the analysis builds upon standard methodologies and object classes from the Bioconductor platform. Unsupervised analysis, extraction of features of interest from user-customized models, and comparisons across contrasts and models are implemented. The entire pipeline is called via a shiny interface and is hosted in the Open Analytics datacenter. This automation of the pipeline, especially in the reporting steps, significantly increases the speed of the analysis, while still allowing the client to hand-tailor their approach for each dataset.

Catching Medical Insurance Fraud

An insurance company wanted to be able to algorithmically detect instances of fraud in their medical insurance records. This was a proper “big data” setting, with hundreds of millions of records from pharmacists, individual subscribers and medical practitioners. On the computing side, we developed and implemented ‘ROpenCL’ , a package that optimizes multicore GPU computations. ROpenCL is built with R, Python and PyOpenCL, and enables efficient parallelization of analysis for very large datasets.

On the statistical side, we trained a nonparametric classification model on a test dataset containing known fraudulent and legitimate records. This was used to generate metrics and decision trees for flagging potentially fraudulent records. These were applied to the primary dataset via RopenCL, and a shortlisted subset of records was returned to the clients for final review.

Spatial Epidemiology Applications

For an international governmental organization we built a set of interactive Shiny applications, including:

  • sample size calculations and random sample selection of the estimated size
  • benchmark dose modelling with automatic report generation
  • spatio-temporal analysis of infectious disease data

One of the applications aims to track an infectious disease and prevent future outbreaks by analyzing and visualizing spatio-temporal data that has been collected over the previous years. In the exploratory analysis, local statistics (Moran’s I and Geary’s C) help to recognize potential clusters and hotspots. Simple smoothed predictions over space and time can be calculated and visualized using ordinary kriging. Logistic regression models allow the inclusion of spatial and temporal effects, spatio-temporal interactions and potential covariates. These models can be either Bayesian hierarchical models or generalized additive models, and several model structures can be compared using model diagnostics. Summary measures for the estimated response values are visualized over space and time.

An interactive Shiny application guides the user in uploading data, performing an exploratory analysis and fitting spatio-temporal models. All methods have been implemented in a stand-alone R-package which allows advanced users to adapt and run specific analysis outside the interactive environment.

Applications are hosted using Shiny Proxy, which offers a secure platform to limit access of each application to its specific target group.

Differentiating Crystal Structures Via Electron Microscopy

In this project, we used image processing techniques to analyze and optimize crystallization procedures in the field of pharmaceutical manufacturing. For manufacturability, the right conditions are sought to form robust crystal structures, facilitating subsequent steps in the manufacturing process.

Crystal structures can be assessed by scanning electron microscopy (SEM). Traditionally, visual inspection of the SEM images allows for a subjective impression of the crystallization result. However, with image processing techniques, we can objectively measure crystal properties like shape, size and texture, and correlated them with manufacturing design variables.

Additionally, we automate image segmentation by using edge detection and contour growing algorithms to delineate individual crystals in the image. Feature extraction techniques are then applied to the segmented images, providing us with quantitative measures for steering the crystallization process.

Interactive Wildlife Reports

For a governmental institute we reviewed their yearly report on geographical and biological characteristics of certain animal species. In close collaboration, we have implemented an interactive shiny application for data visualization and report generation. Users can easily filter and subset their data by year, animal type and geographic region to create a diverse suite of time series and spatial visualizations.

The dynamic application interface allows users to adapt the contents of the report to their specific interests. Further, users can download their graphs and the underlying data for further use outside the application. The client is hosting this application on his own server using Shiny Proxy.