glut в ml

This commit is contained in:
2024-11-05 13:53:32 +03:00
parent 88a0b910c3
commit 8c5ca993e9

View File

@@ -149,6 +149,12 @@
The authors of article~\cite{kras} applied machine learning algorithms for two goals. Firstly, they used algorithms to extract genes highly related with therapy resistance. Each sample of their data contained the expression of 8687 genes and only a small portion was correlated with targeted therapy resistance. To extract highly related genes in this study authors attempted seven algorithms, including Least Absolute Shrinkage and Selection Operator (LASSO), Light Gradient Boosting Machine (LightGBM), Monte Carlo Feature Selection (MCFS), Minimum Redundancy Maximum Relevance (mRMR), Random Forest (RF) -based, Categorical Boosting (CATBoost), and eXtreme Gradient Boosting (XGBoost). Secondly, they selected four algorithms to perform binary classification (resistant vs sensitive) of tumor cells based on extracted features, namely, random forest (RF), support vector machine (SVM), K-Nearest Neighbors (KNN), and decision tree (DT).
The authors of article~\cite{glut} took an alternative approach: instead of directly predicting chemotherapy resistance, they constructed the machine learning-derived immunosenescence-related score (MLIRS) score. Patients with high MLIRS scores had a worse prognosis. In contrast, the low MLIRS score group demonstrated greater sensitivity to both chemotherapy and immunotherapy. To obtain an optimal hazard scoring system, they trained a total of 101 combined machine learning algorithms (based on 10-fold cross-validation) across 10 basal categories: survival support vector machine (survival-SVM), CoxBoost, random survival forest (RSF), Lasso, stepwise Cox, partial least squares regression for Cox (plsRcox), Ridge, supervised principal components (SuperPC), elastic network (Enet), and generalized boosted regression modeling (GBM). In this study, these algorithms were applied to a regression task, allowing the authors to compute the coefficients for the MLIRS formula:
$$
\text{MLIRS} = (\text{expr}_{\text{gene1}} \times \text{coff}_{\text{gene1}}) + (\text{expr}_{\text{gene2}} \times \text{coff}_{\text{gene2}}) + \ldots + (\text{expr}_{\text{gene}_n} \times \text{coff}_{\text{gene}_n})
$$
where: \(\text{expr}_{\text{gene}}\) denotes the expression level of each gene, \(\text{coff}_{\text{gene}}\) represents the coefficient for each gene, as determined by the model. Authors derived the C-index value of each machine learning algorithm in each dataset and identified the algorithm with the largest mean C-index as the optimal hazard scoring algorithm.
\section{Datasets}
Data plays a crucial role in machine learning, serving as the foundation for model training and evaluation. The quality and quantity of data directly influence the performance and generalizability of machine learning algorithms. In the fields of biology and medicine, data collection is often costly and time-consuming. Additionally, the complexity and variability inherent in biological systems further complicate data acquisition and interpretation. In cancer research, these challenges are even more pronounced due to the heterogeneity of tumors and the intricate nature of cancer biology. However, there are valuable resources available, such as the Gene Expression Omnibus (GEO) database~\cite{geo} and The Cancer Genome Atlas (TCGA) database~\cite{tcga}, which provide researchers with access to extensive datasets. Moreover, nonprofit organizations like the American Type Culture Collection (ATCC)~\cite{atcc} enable researchers to obtain biological materials, including cancer cells.