Seagull: lasso, group lasso and sparse-group lasso regularization for linear regression models via proximal gradient des

  • PDF / 840,742 Bytes
  • 8 Pages / 595.276 x 793.701 pts Page_size
  • 100 Downloads / 177 Views

DOWNLOAD

REPORT


SOFTWARE

Open Access

Seagull: lasso, group lasso and sparsegroup lasso regularization for linear regression models via proximal gradient descent Jan Klosa1, Noah Simon2, Pål Olof Westermark1, Volkmar Liebscher3 and Dörte Wittenburg1* * Correspondence: [email protected] 1 Institute of Genetics and Biometry, Leibniz Institute for Farm Animal Biology, 18196 Dummerstorf, Germany Full list of author information is available at the end of the article

Abstract Background: Statistical analyses of biological problems in life sciences often lead to high-dimensional linear models. To solve the corresponding system of equations, penalization approaches are often the methods of choice. They are especially useful in case of multicollinearity, which appears if the number of explanatory variables exceeds the number of observations or for some biological reason. Then, the model goodness of fit is penalized by some suitable function of interest. Prominent examples are the lasso, group lasso and sparse-group lasso. Here, we offer a fast and numerically cheap implementation of these operators via proximal gradient descent. The grid search for the penalty parameter is realized by warm starts. The step size between consecutive iterations is determined with backtracking line search. Finally, seagull -the R package presented here- produces complete regularization paths. Results: Publicly available high-dimensional methylation data are used to compare seagull to the established R package SGL. The results of both packages enabled a precise prediction of biological age from DNA methylation status. But even though the results of seagull and SGL were very similar (R2 > 0.99), seagull computed the solution in a fraction of the time needed by SGL. Additionally, seagull enables the incorporation of weights for each penalized feature. Conclusions: The following operators for linear regression models are available in seagull: lasso, group lasso, sparse-group lasso and Integrative LASSO with Penalty Factors (IPF-lasso). Thus, seagull is a convenient envelope of lasso variants. Keywords: Optimization, Machine learning, High-dimensional data, R package

Background Linear regression is a widely used tool to explore the dependence between a response variable and explanatory variables. For example, in genome-wide association studies, counts of genetic variants along the genome are related to records of a disease or performance trait. The high throughput of modern biotechnological procedures enables © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the materi