cervical в feature analysis

2024-11-30 22:04:28 +03:00
parent 95049d4a2f
commit f35f9e01b1
1 changed files with 20 additions and 0 deletions
--- a/report.tex
+++ b/report.tex
@@ -199,6 +199,26 @@
    
    In \cite{kras} feature importance analysis was employed to identify genes associated with resistance to KRAS G12C inhibitor treatment in cancer cells. The authors used seven different feature ranking algorithms: LASSO, LightGBM, MCFS, mRMR, RF-based, CATBoost, and XGBoost. These algorithms generated feature lists based on different principles, enabling a comprehensive evaluation of gene significance. To refine the feature selection, the authors applied Incremental Feature Selection (IFS), testing the performance of classifiers like Decision Tree (DT), k-Nearest Neighbors (KNN), Random Forest (RF), and Support Vector Machine (SVM) on the ranked features. By doing feature analysis they were able to highlight several key genes, such as H2AFZ, CKS1B, and TUBA1B, which were consistently ranked highly across multiple algorithms and are linked to tumor progression and drug resistance.

+    The authors of \cite{cervical} used feature importance analysis based on the Random Forest (RF) model to identify key SNPs related to NACT sensitivity in LACC patients. The importance of each feature was calculated by assessing its impact on impurity reduction at each node in the RF model, with a larger decrease in impurity indicating greater feature importance. The mean decrease in impurity (MDI) was calculated using the total decrease in impurity averaged over all decision trees. The impurity \(g\) of a split was computed as:
+
+    \[
+    g = 2 \cdot p_A \cdot p_B
+    \]
+
+    where \(p_A\) and \(p_B\) represent the probabilities of class A and class B, respectively. The overall impurity \(G\) for a split was calculated as the weighted average of the impurities of its two sub-splits:
+
+    \[
+    G = P_1 \cdot g_{1} + P_2 \cdot g_{2}
+    \]
+
+    where \(P_1\) and \(P_2\) are the proportions of data in the sub-splits and \(g_1\), \(g_2\) are the impurities of each sub-split. Feature importance for each SNP was calculated by summing the importance of all features generated by each SNP (after one-hot encoding):
+
+    \[
+    L_i = \sum_j f_{i,j}
+    \]
+
+    where \(L_i\) is the importance of SNP \(i\), and \(f_{i,j}\) represents the importance of feature \(j\) generated by SNP \(i\).
+
    \section{Results}
    In all works, the construction of machine learning models is essentially a secondary result. First of all, studies show the applicability of these methods to tasks related to the problems of cancer cell resistance to chemotherapy. Also, using machine learning methods, the authors test their hypotheses, confirm or discover links between various characteristics of cancer cells, patient clinical data and drug resistance.