Integrating Built-in Feature Importance and Mutual Information for Efficient Android Malware Classification Model
DOI:
https://doi.org/10.25139/inform.v11i1.11220Keywords:
Malware Classification Model, Machine Learning, Feature Selection, Feature Importance, Data Security, AndroidAbstract
The rapid growth of the Android operating system in the global market demands efficient and accurate malware detection solutions. This study proposes an Android malware classification approach based on machine learning with a focus on feature selection optimization to achieve an optimal balance between performance and computational efficiency. Using the CICMalDroid2020 dataset, consisting of 11598 samples and 470 dynamic features, this study evaluates four machine learning algorithms (LightGBM, XGBoost, Random Forest, and K-Nearest Neighbours) combined with two feature selection methods: Mutual Information (MI) and Embedded Feature Importance (FI). Experiments were conducted with automatic feature selection over 50-475 features to identify the optimal configuration for each model. The research results show that LightGBM with Feature Importance achieves the best performance, with an accuracy of 96.49%, an F1-score of 95.74%, using only 270 features (a 42.6% reduction), and the fastest test time of 0.036 seconds. XGBoost FI achieves 96.31% accuracy with 225 features (52.1% reduction), Random Forest MI achieves 95.62% with 240 features, while KNN MI achieves 91.37% with 135 features. Feature overlap analysis reveals that the 135 core features selected by KNN MI are a subset of features from other models, with dominant categories including system calls (40%), Android API (25%), network operations (15%), file system patterns (10%), and behavioural patterns (10%). This research shows that the Feature Importance method from tree-based algorithms outperforms Mutual Information by 5-6% in capturing non-linear dependencies and complex interactions in malware behaviour. Feature Importance can detect contextual patterns, such as the combination of getDeviceId and NETWORK_ACCESS, that are only dangerous when occurring simultaneously, which are more easily detected by tree-based methods. The optimal range of 225-270 features provides a sweet spot between comprehensiveness and efficiency; XGBoost with 225 features is only 0.18% below LightGBM but 16.7% more computationally efficient, making it ideal for real-time scanning. The main contribution of this research is the development of a light-weight yet reliable model without destructive sampling techniques, providing a practical solution for real-time malware detection on Android devices with limited resources. This approach successfully reduces dimensions by up to 52% while maintaining, or even improving, performance, making significant contributions to the development of efficient, accurate, and applicable Android malware detection techniques for real-time security systems. For further development, exploring LightGBM-XGBoost ensembles could increase accuracy beyond 97%, along with advanced feature engineering and periodic evaluation of the latest malware variants.
References
D. Das, S. M. Satapathy, A. D, and A. Agarwal, "Application of Hybrid Approach towards Multi Aspect Classification and Analysis of Malware," in 2023 OITS International Conference on Information Technology (OCIT), Dec. 2023, pp. 284–289. doi: 10.1109/OCIT59427.2023.10430958.
A. Buriro, A. B. Buriro, T. Ahmad, S. Buriro, and S. Ullah, "MalwD&C: A Quick and Accurate Machine Learning-Based Approach for Malware Detection and Categorization," Applied Sciences, vol. 13, no. 4, p. 2508, Feb. 2023, doi: 10.3390/app13042508.
A. Redhu, P. Choudhary, K. Srinivasan, and T. K. Das, "Deep learning-powered malware detection in cyberspace: a contemporary review," Front. Phys., vol. 12, Mar. 2024, doi: 10.3389/fphy.2024.1349463.
A. Guerra-Manzanares, "Machine Learning for Android Malware Detection: Mission Accomplished? A Comprehensive Review of Open Challenges and Future Perspectives," Computers & Security, vol. 138, p. 103654, Mar. 2024, doi: 10.1016/j.cose.2023.103654.
K. Brezinski and K. Ferens, "Metamorphic Malware and Obfuscation: A Survey of Techniques, Variants, and Generation Kits," 2023, doi: 10.1155/2023/8227751.
P. Arora, R. Gupta, N. Malik, and A. Kumar, "Malware Analysis Types & Techniques : A Survey," in Proceedings of the 5th International Conference on Information Management & Machine Intelligence, in ICIMMI '23. New York, NY, USA: Association for Computing Machinery, May 2024, pp. 1–6. doi: 10.1145/3647444.3652439.
E. Al. Rajesh Yadav, "Malware Detection and Analysis Tools," IJRITCC, vol. 11, no. 11s, pp. 735–744, Nov. 2023, doi: 10.17762/ijritcc.v11i11s.9817.
J. Ferdous, R. Islam, A. Mahboubi, and M. Z. Islam, "A Survey on ML Techniques for Multi-Platform Malware Detection: Securing PC, Mobile Devices, IoT, and Cloud Environments," Sensors, vol. 25, no. 4, p. 1153, Jan. 2025, doi: 10.3390/s25041153.
John Oluwafemi Ogun, "Advancements in automated malware analysis: evaluating the efficacy of open-source tools in detecting and mitigating emerging malware threats to US businesses," Int. J. Sci. Res. Arch., vol. 12, no. 2, pp. 1958–1964, Aug. 2024, doi: 10.30574/ijsra.2024.12.2.1488.
G. M. and S. C. Sethuraman, "A comprehensive survey on deep learning based malware detection techniques," Computer Science Review, vol. 47, p. 100529, Feb. 2023, doi: 10.1016/j.cosrev.2022.100529.
V. Jyothsna, P. Mokshitha, S. Khulud, L. G. Premanath Reddy, N. J. Reddy, and B. Pydala, "Advancing Android Security: Leveraging Stacking Ensemble and Bioinspired Feature Selection for Efficient Malware Detection," 2024 5th International Conference for Emerging Technology (INCET), pp. 1–11, May 2024, doi: 10.1109/INCET61516.2024.10593208.
I. S. Makkar, A. K. Sinha, T. Pratap, H. Pandey, and B. Nandwana, "Android Malware Detection: Necessity, Applications and Future Direction," in 2025 IEEE International Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI), Mar. 2025, pp. 1–6. doi: 10.1109/IATMSI64286.2025.10985193.
A. Razgallah, R. Khoury, S. Hallé, and K. Khanmohammadi, "A survey of malware detection in Android apps: Recommendations and perspectives for future research," Computer Science Review, vol. 39, p. 100358, Feb. 2021, doi: 10.1016/j.cosrev.2020.100358.
A. Bensaoud, J. Kalita, and M. Bensaoud, "A survey of malware detection using deep learning," Machine Learning with Applications, vol. 16, p. 100546, Jun. 2024, doi: 10.1016/j.mlwa.2024.100546.
A. Brown, M. Gupta, and M. Abdelsalam, "Automated Machine Learning for Deep Learning based Malware Detection," Nov. 03, 2023, arXiv: arXiv:2303.01679. doi: 10.48550/arXiv.2303.01679.
L. Alzubaidi et al., "A survey on deep learning tools dealing with data scarcity: definitions, challenges, solutions, tips, and applications," J Big Data, vol. 10, no. 1, p. 46, Apr. 2023, doi: 10.1186/s40537-023-00727-2.
W. Ghozi, H. Lestiawan, R. R. Sani, J. N. Hussein, and F. A. Rafrastara, "XGBoost-Powered Ransomware Detection: A Gradient-Based Machine Learning Approach for Robust Performance," KINETIK, Oct. 2025, doi: 10.22219/kinetik.v10i4.2405.
S. Mahdavifar, A. F. Abdul Kadir, R. Fatemi, D. Alhadidi, and A. A. Ghorbani, "Dynamic Android Malware Category Classification using Semi-Supervised Deep Learning," in 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Calgary, AB, Canada: IEEE, Aug. 2020, pp. 515–522. doi: 10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00094.
E. G. Villarroel Enriquez and J. Gutiérrez-Cárdenas, “Dynamic Malware Analysis Using Machine Learning-Based Detection Algorithms,” Interfases, no. 019, pp. 119–138, Jul. 2024, doi: 10.26439/interfases2024.n19.7097.
M. Altalhan, A. Algarni, and M. Turki-Hadj Alouane, "Imbalanced Data Problem in Machine Learning: A Review," IEEE Access, vol. 13, pp. 13686–13699, 2025, doi: 10.1109/ACCESS.2025.3531662.
H. AlOmari, Q. M. Yaseen, and M. A. Al-Betar, "A Comparative Analysis of Machine Learning Algorithms for Android Malware Detection," Procedia Computer Science, vol. 220, pp. 763–768, 2023, doi: 10.1016/j.procs.2023.03.101.
Y. Zhao et al., "On the Impact of Sample Duplication in Machine-Learning-Based Android Malware Detection," ACM Trans. Softw. Eng. Methodol., vol. 30, no. 3, pp. 1–38, Jul. 2021, doi: 10.1145/3446905.
A. F. Ahmad, M. S. Sayeed, K. Alshammari, and I. Ahmed, "Impact of Missing Values in Machine Learning: A Comprehensive Analysis," Oct. 10, 2024, arXiv: arXiv:2410.08295. doi: 10.48550/arXiv.2410.08295.
V. Vajrobol, B. B. Gupta, and A. Gaurav, "Mutual information based logistic regression for phishing URL detection," Cyber Security and Applications, vol. 2, p. 100044, Jan. 2024, doi: 10.1016/j.csa.2024.100044.
A. I. Adler and A. Painsky, "Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection," Entropy (Basel), vol. 24, no. 5, p. 687, May 2022, doi: 10.3390/e24050687.
U. Ahmed, A. Mahmood, M. A. Tunio, G. Hafeez, A. R. Khan, and S. Razzaq, "Investigating boosting techniques' efficacy in feature selection: A comparative analysis," Energy Reports, vol. 11, pp. 3521–3532, Jun. 2024, doi: 10.1016/j.egyr.2024.03.020.
G. Ke et al., "LightGBM: A Highly Efficient Gradient Boosting Decision Tree," in Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017.
I. B. Mustapha et al., "Comparative Analysis of Gradient-Boosting Ensembles for Estimation of Compressive Strength of Quaternary Blend Concrete," Int J Concr Struct Mater, vol. 18, no. 1, p. 20, Apr. 2024, doi: 10.1186/s40069-023-00653-w.
T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco California USA: ACM, Aug. 2016, pp. 785–794. doi: 10.1145/2939672.2939785.
S. Hakkal and A. A. Lahcen, "XGBoost To Enhance Learner Performance Prediction," Computers and Education: Artificial Intelligence, vol. 7, p. 100254, Dec. 2024, doi: 10.1016/j.caeai.2024.100254.
H. A. Salman, A. Kalakech, and A. Steiti, "Random Forest Algorithm Overview," Babylonian Journal of Machine Learning, vol. 2024, pp. 69–79, Jun. 2024, doi: 10.58496/BJML/2024/007.
N. Mukahar, "Performance comparison of k nearest neighbor classifier with different distance functions," AIP Conf. Proc., vol. 2895, no. 1, p. 040010, Mar. 2024, doi: 10.1063/5.0192229.
T. Hu and X.-H. Zhou, "Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions," Apr. 14, 2024, arXiv: arXiv:2404.09135. doi: 10.48550/arXiv.2404.09135.
V. Hurbungs, V. Bassoo, and T. P. Fowdur, "A novel One-vs-Next approach for multi-class classification," 2024 IEEE Symposium on Computers and Communications (ISCC), pp. 1–6, Jun. 2024, doi: 10.1109/ISCC61673.2024.10733675.
M. Kudo, T. Takahashi, and H. Yamana, "Touch-Based Continuous Mobile Device Authentication Using One-vs-One Classification Approach," in 2024 IEEE International Conference on Big Data and Smart Computing (BigComp), Feb. 2024, pp. 167–174. doi: 10.1109/BigComp60711.2024.00034.
W. Liu, I. W. Tsang, and K.-R. Muller, "An Easy-to-hard Learning Paradigm for Multiple Classes and Multiple Labels".
J. Allen, H. Liu, S. Iqbal, D. Zheng, and G. Stansby, "Deep learning-based photoplethysmography classification for peripheral arterial disease detection: a proof-of-concept study," Physiol. Meas., vol. 42, no. 5, p. 054002, May 2021, doi: 10.1088/1361-6579/abf9f3.
S. Widodo, H. Brawijaya, and S. Samudi, "Stratified K-fold cross validation optimization on machine learning for prediction," Sinkron : jurnal dan penelitian teknik informatika, vol. 6, no. 4, pp. 2407–2414, Oct. 2022, doi: 10.33395/sinkron.v7i4.11792.
S. Prusty, S. Patnaik, and S. K. Dash, "SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer," Front. Nanotechnol., vol. 4, p. 972421, Aug. 2022, doi: 10.3389/fnano.2022.972421.
S. Amenova, C. Turan, and D. Zharkynbek, "Android Malware Classification by CNN-LSTM," in 2022 International Conference on Smart Information Systems and Technologies (SIST), Nur-Sultan, Kazakhstan: IEEE, Apr. 2022, pp. 1–4. doi: 10.1109/SIST54437.2022.9945816.
Y. Sönmez, M. Salman, and M. Dener, "Performance Analysis of Machine Learning Algorithms for Malware Detection by Using CICMalDroid2020 Dataset," Düzce Üniversitesi Bilim ve Teknoloji Dergisi, vol. 9, no. 6, pp. 280–288, Dec. 2021, doi: 10.29130/dubited.1018223.
Downloads
Published
How to Cite
Issue
Section
License
Authors who publish with Inform: Jurnal Ilmiah Bidang Teknologi Informasi dan Komunikasi agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC BY-SA 4.0) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.








