Skip to main contentSkip to search and navigation

UEF eREPOSITORY

    • English
    • suomi
  • English 
    • English
    • suomi
  • Login
View Item 
  •   Home
  • Artikkelit
  • Terveystieteiden tiedekunta
  • View Item
  •   Home
  • Artikkelit
  • Terveystieteiden tiedekunta
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study

Thumbnail
Files
Article (1.334Mb)
Self archived version
published version
Date
2019
Author(s)
Kokla, M
Virtanen, J
Kolehmainen, M
Paananen, J
Hanhineva, K
Unique identifier
10.1186/s12859-019-3110-0
Metadata
Show full item record
More information
Research Database SoleCris

Self-archived article

Citation
Kokla, M. Virtanen, J. Kolehmainen, M. Paananen, J. Hanhineva, K. (2019). Random forest-based imputation outperforms other methods for imputing LC-MS metabolomics data: a comparative study.  Bmc bioinformatics, 20 (1) , 492. 10.1186/s12859-019-3110-0.
Rights
© Authors
Licensed under
CC BY http://creativecommons.org/licenses/by/4.0/
Abstract

Background
LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis.

Results
Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin.

Conclusion
Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.

URI
https://erepo.uef.fi/handle/123456789/7981
Link to the original item
http://dx.doi.org/10.1186/s12859-019-3110-0
Collections
  • Terveystieteiden tiedekunta [1324]
University of Eastern Finland
OpenAccess
eRepo
erepo@uef.fi
OpenUEF
Service provided by
the University of Eastern Finland Library
Library web pages
Twitter
Facebook
Youtube
Library blog
 sitemap
Search

Browse

All of the ArchiveResource types & CollectionsBy Issue DateAuthorsTitlesSubjectsFacultyDepartmentFull organizationSeriesMain subjectThis CollectionBy Issue DateAuthorsTitlesSubjectsFacultyDepartmentFull organizationSeriesMain subject

My Account

Login
University of Eastern Finland
OpenAccess
eRepo
erepo@uef.fi
OpenUEF
Service provided by
the University of Eastern Finland Library
Library web pages
Twitter
Facebook
Youtube
Library blog
 sitemap