Novel Tools for Missing Protein Identification and Quantification

The high dimensional, complex, and indirect nature of proteomic mass spectrometry data results in significant difficulties for data science, including an unacceptably high number of missing values and batch effects. Weijia’s research addresses these challenges through two key areas:

1) Exploring the interactions between missing value imputation, data normalization, and batch effect correction. They have made detailed studies on how batch effects might influence missing-value imputation and provide guidance on when to do batch effect-correction and missing-value imputation.

2) Developing and refining the PROTREC-PROSE-PROJECT family of missing protein inference and quantification methods. This approach utilizes protein complexes and other biological networks as prior knowledge to infer missing proteins in a straightforward manner. PROTREC infer the probability of a protein from the probability of its parent complexes using Bayesian reasoning, which does not require information from other samples. PROSE prioritization of proteins based on gene co-expression network matrices. PROJECT takes advantage of surrounding information and impute the missing values based on combined similarity regression model. It remains stable and robust no matter how the missing values are allocated and is not restricted by the sample size or sample value distribution.

These methods have shown promising results and hold potential for advancing the development of robust proteomics-based diagnostic and prognostic models and identifying disease-causing protein dysfunctions.