Machine learning as a support for data exfiltration attacks: an integrated architecture for reducing detection risks
Keywords:
Data exfiltration; Machine learning; Active cyber defense.Abstract
This article aims to propose an integrated
architecture for optimizing data exfiltration
actions using open data sources and machine
learning techniques. The application flow
provides for the detection of artifacts of greater
interest to the executing agent through target
selection and the consequent reduction in the
volume of data involved in the data exfiltration
action, thus allowing to speed up the pace and
reduce the risk of attack detection. The
proposed architecture is composed of three
components: a crawler that uses search engines
known on the web to search and collect files in
PDF format; a topic modeling component that
classifies collected files; and a machine learning
component that uses classified documents to
train an algorithm to identify similar documents.
With the implementation of a proof of concept,
this article demonstrates that it is possible to
achieve the intended objectives, resulting in a
90% reduction in the volume of data involved in
a data exfiltration action with the proposed
architecture, reducing the execution time and
the risks of detecting the action.
Downloads
References
ANACONDA INC. (org.). Anaconda: the world's
most popular data science platform. Disponível
em: https://www.anaconda.com/. Acesso em:
20 maio 2021.
ANDRADE, I. C.; DEUS, G. R. Intelligence
gathering architecture. 2021. Disponível
em: https://github.com/isabellecda/intglgathering-arch. Acesso em: 20 maio 2021.
COLE, E. Advanced Persistent Threat:
understanding the danger and how to protect
your organization. Massachusetts, USA: Elsevier,
2013.
EXPLOSION SOFTWARE COMPANY. Spacy:
industrial-strength natural language processing.
Disponível em: https://spacy.io/. Acesso em: 20
maio 2021.
Página 41
Figura 7 - 53GB de dados coletados sem o auxílio de um exploit otimizado.
Fonte: Elaborada pelos autores (2021).
BERRAR, D. Cross-Validation. Encyclopedia Of
Bioinformatics And Computational Biology,
Oxford, v. 1, p. 542-545, 2019.
BIG Data and Information Security: Most Feared
Cyber-threats. Business Application Research
Center, 2021. Disponível em: https:/ / bisurvey.com/cyber-threats-types. Acesso em: 04
abr. 2021.
BOYD-GRABER, J.; HU, Y.; MIMNO, D. Applications of Topic Models. Foundations And Trends®
In Information Retrieval, [S.L.], v. 11, n. 2-
3, p. 143-296, 2017. Now Publishers. http://
dx.doi.org/10.1561/1500000030.
BRASIL. Decreto nº 9.637, de 26 de dezembro
de 2018. Institui a Política Nacional de Seguran-
ça da Informação. Diário Oficial da União: seção
1, Brasília, DF, n. 248, p. 23, 27 dez. 2018.
BRASIL. Decreto nº 10.222, de 5 de fevereiro de
2020. Aprova a Estratégia Nacional de Seguran-
ça Cibernética. Diário Oficial da União: seção 1,
Brasília, DF, n. 26, p. 6, 6 fev. 2020.
CEPIK, M. A. C. Espionagem e Democracia. Rio
de Janeiro: Editora FGV, 2003.
DEWAR, R. S. The “triptych of cyber security”: a
classification of active cyber defence. In: 2014
6TH INTERNATIONAL CONFERENCE ON CYBER
CONFLICT (CYCON), 6., 2014, Tallinn, Estonia.
Proceedings [...] . Tallinn, Estonia: IEEE,
2014. p. 7-21.
JOSSEN, S. The world’s most valuable resource
is no longer oil, but data. The Economist. 06
maio. 2017. Disponível em: https://
www.economist.com/leaders/2017/05/06/theworlds-most-valuable-resource-is-no-longer-oilbut-data. Acesso em: 04 abr. 2021.
HEINL, C. H. Artificial (intelligent) agents and
active cyber defence: policy implications. In:
2014 6TH INTERNATIONAL CONFERENCE ON
CYBER CONFLICT (CYCON), 6., 2014, Tallinn,
Estonia. Proceedings [...] . Tallinn, Estonia:
IEEE, 2014. p. 53-66.
HEYDON, A.; NAJORK, M. M.: A scalable, extensible Web crawler. Compaq Systems Research
Center, P alo Alto, p. 220, dez. 1999.
HONNIBAL, M. et al. spaCy: Industrialstrength Natural Language Processing in Python.
Versão 3.0.6. [S. l.],2016. Disponível em:
https://spacy.io/. Acesso em: 20 maio 2021.
KUNDER, M. The size of the World Wide Web
(The Internet). Disponível em: https:/ /
www.worldwidewebsize.com/. Acesso em: 04
abr. 2021.
LEE, S.; SHON, T. Open source intelligence base
cyber threat inspection framework for critical
infrastructures. In: 2016 FUTURE TECHNOLOGIES CONFERENCE (FTC), 1., 2016, San Francisco,
CA, USA. Proceedings [...] . San Francisco, CA,
USA: IEEE, 2016. p. 1030-1033.
LOCKHEED MARTIN. Cyber Kill Chain. 2021. Disponível em: https://www.lockheedmartin.com/
en-us/capabilities/cyber/cyber-kill-chain.html.
Acesso em: 30 abr. 2021.
MUNCASTER, P. US Military Personnel Exposed
in Latest Cloud Data Leak. Info Security Magazine, 22 out. 2019. Disponível em: https:/ /
www.infosecurity-magazine.com/news/militarypersonnel-exposed-latest/. Acesso em: 07 abr.
2021.
MURPHY, K. P. Machine Learning: a probabilistic
perspective. Massachusetts: Massachusetts Institute of Technology, 2012.
O’DEA, S. Estimated internet traffic in the United
States from 2018 to 2023. STATISTA, 09 jun.
2020. Disponível em: https://www.statista.com/
statistics/216335/data-usage-per-month-in-theus-by-age/. Acesso em: 04 abr. 2021.
PEDREGOSA, F. et al. Scikit-learn: machine learning in python. Journal Of Machine Learning Research, [S. L.] , v. 12, p. 2825-2830, 2011.
PROJECT JUPYTER (org.). Jupyter. Disponível
em: https://jupyter.org/. Acesso em: 20 maio
2021.
RAPÔSO, C. F. L. et al. LGPD-Lei Geral de Prote-
ção de Dados Pessoais em Tecnologia da Informação: Revisão Sistemática. RACE-Revista de
Administração do Cesmac, v. 4, p. 58-67,
2019.
ROESCH, M. Snort: network intrusion detection
& prevention system. Network Intrusion Detection & Prevention System. Disponível em:
https://www.snort.org/. Acesso em: 22 maio
2021.
SARAVIA, E. Fundamentals of NLP: tokenization,
lemmatization, stemming, and sentence segmentation. 2020. Disponível em: https://dair.ai/
notebooks/nlp/2020/03/19/
nlp_basics_tokenization_segmentation.html.
Acesso em: 19 jun. 2021.
TABATABAEI, F.; WELLS, D. OSINT in the Context of Cyber-Security. In: AKHGAR, Babak et al
(ed.). Open Source Intelligence Investigation:
from strategy to implementation. [S.l.]: Springer,
2016. p. 213-231.
TRUONG, C. T., ZELINKA, I. A Survey on Artificial Intelligence in Malware as Next-Generation
Threats. MENDEL, v. 25, n. 2, p. 27-34, 20 dez.
2019.
Página 42
ULLAH, F. et al. Data exfiltration: a review of
external attack vectors and countermeasures.
Journal Of Network And Computer
Applications, [S.l.], v. 101, p. 18-54, 1 jan.
2018. Elsevier BV.
*Artigo realizado a partir do trabalho de
conclusão do Curso de Especialização em Guerra
Cibernética do Centro de Instrução de Guerra
Eletrônica – CIGE pelos Tenentes Isabelle Cecilia
de Andrade e Guilherme Resende Deus.
Endereço postal: DF-001, 5, Lago Norte.
Brasília, Distrito Federal – DF, CEP: 71559-902.
email: isabelleica@fab.mil.br,
guilhermegrd@fab.mil.br.
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Data & Hertz

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
- Attribution (BY): Appropriate credit must be given to the author.
- NonCommercial (NC): The work may not be used for commercial purposes.
- ShareAlike (SA): Derivative works must be licensed under the same terms.