Machine learning as a support for data exfiltration attacks: an integrated architecture for reducing detection risks

Authors

  • Isabelle Cecilia de Andrade
  • Guilherme Resende Deus

Keywords:

Data exfiltration; Machine learning; Active cyber defense.

Abstract

This article aims to propose an integrated

architecture for optimizing data exfiltration

actions using open data sources and machine

learning techniques. The application flow

provides for the detection of artifacts of greater

interest to the executing agent through target

selection and the consequent reduction in the

volume of data involved in the data exfiltration

action, thus allowing to speed up the pace and

reduce the risk of attack detection. The

proposed architecture is composed of three

components: a crawler that uses search engines

known on the web to search and collect files in

PDF format; a topic modeling component that

classifies collected files; and a machine learning

component that uses classified documents to

train an algorithm to identify similar documents.

With the implementation of a proof of concept,

this article demonstrates that it is possible to

achieve the intended objectives, resulting in a

90% reduction in the volume of data involved in

a data exfiltration action with the proposed

architecture, reducing the execution time and

the risks of detecting the action.

Downloads

Download data is not yet available.

References

ANACONDA INC. (org.). Anaconda: the world's

most popular data science platform. Disponível

em: https://www.anaconda.com/. Acesso em:

20 maio 2021.

ANDRADE, I. C.; DEUS, G. R. Intelligence

gathering architecture. 2021. Disponível

em: https://github.com/isabellecda/intglgathering-arch. Acesso em: 20 maio 2021.

COLE, E. Advanced Persistent Threat:

understanding the danger and how to protect

your organization. Massachusetts, USA: Elsevier,

2013.

EXPLOSION SOFTWARE COMPANY. Spacy:

industrial-strength natural language processing.

Disponível em: https://spacy.io/. Acesso em: 20

maio 2021.

Página 41

Figura 7 - 53GB de dados coletados sem o auxílio de um exploit otimizado.

Fonte: Elaborada pelos autores (2021).

BERRAR, D. Cross-Validation. Encyclopedia Of

Bioinformatics And Computational Biology,

Oxford, v. 1, p. 542-545, 2019.

BIG Data and Information Security: Most Feared

Cyber-threats. Business Application Research

Center, 2021. Disponível em: https:/ / bisurvey.com/cyber-threats-types. Acesso em: 04

abr. 2021.

BOYD-GRABER, J.; HU, Y.; MIMNO, D. Applications of Topic Models. Foundations And Trends®

In Information Retrieval, [S.L.], v. 11, n. 2-

3, p. 143-296, 2017. Now Publishers. http://

dx.doi.org/10.1561/1500000030.

BRASIL. Decreto nº 9.637, de 26 de dezembro

de 2018. Institui a Política Nacional de Seguran-

ça da Informação. Diário Oficial da União: seção

1, Brasília, DF, n. 248, p. 23, 27 dez. 2018.

BRASIL. Decreto nº 10.222, de 5 de fevereiro de

2020. Aprova a Estratégia Nacional de Seguran-

ça Cibernética. Diário Oficial da União: seção 1,

Brasília, DF, n. 26, p. 6, 6 fev. 2020.

CEPIK, M. A. C. Espionagem e Democracia. Rio

de Janeiro: Editora FGV, 2003.

DEWAR, R. S. The “triptych of cyber security”: a

classification of active cyber defence. In: 2014

6TH INTERNATIONAL CONFERENCE ON CYBER

CONFLICT (CYCON), 6., 2014, Tallinn, Estonia.

Proceedings [...] . Tallinn, Estonia: IEEE,

2014. p. 7-21.

JOSSEN, S. The world’s most valuable resource

is no longer oil, but data. The Economist. 06

maio. 2017. Disponível em: https://

www.economist.com/leaders/2017/05/06/theworlds-most-valuable-resource-is-no-longer-oilbut-data. Acesso em: 04 abr. 2021.

HEINL, C. H. Artificial (intelligent) agents and

active cyber defence: policy implications. In:

2014 6TH INTERNATIONAL CONFERENCE ON

CYBER CONFLICT (CYCON), 6., 2014, Tallinn,

Estonia. Proceedings [...] . Tallinn, Estonia:

IEEE, 2014. p. 53-66.

HEYDON, A.; NAJORK, M. M.: A scalable, extensible Web crawler. Compaq Systems Research

Center, P alo Alto, p. 220, dez. 1999.

HONNIBAL, M. et al. spaCy: Industrialstrength Natural Language Processing in Python.

Versão 3.0.6. [S. l.],2016. Disponível em:

https://spacy.io/. Acesso em: 20 maio 2021.

KUNDER, M. The size of the World Wide Web

(The Internet). Disponível em: https:/ /

www.worldwidewebsize.com/. Acesso em: 04

abr. 2021.

LEE, S.; SHON, T. Open source intelligence base

cyber threat inspection framework for critical

infrastructures. In: 2016 FUTURE TECHNOLOGIES CONFERENCE (FTC), 1., 2016, San Francisco,

CA, USA. Proceedings [...] . San Francisco, CA,

USA: IEEE, 2016. p. 1030-1033.

LOCKHEED MARTIN. Cyber Kill Chain. 2021. Disponível em: https://www.lockheedmartin.com/

en-us/capabilities/cyber/cyber-kill-chain.html.

Acesso em: 30 abr. 2021.

MUNCASTER, P. US Military Personnel Exposed

in Latest Cloud Data Leak. Info Security Magazine, 22 out. 2019. Disponível em: https:/ /

www.infosecurity-magazine.com/news/militarypersonnel-exposed-latest/. Acesso em: 07 abr.

2021.

MURPHY, K. P. Machine Learning: a probabilistic

perspective. Massachusetts: Massachusetts Institute of Technology, 2012.

O’DEA, S. Estimated internet traffic in the United

States from 2018 to 2023. STATISTA, 09 jun.

2020. Disponível em: https://www.statista.com/

statistics/216335/data-usage-per-month-in-theus-by-age/. Acesso em: 04 abr. 2021.

PEDREGOSA, F. et al. Scikit-learn: machine learning in python. Journal Of Machine Learning Research, [S. L.] , v. 12, p. 2825-2830, 2011.

PROJECT JUPYTER (org.). Jupyter. Disponível

em: https://jupyter.org/. Acesso em: 20 maio

2021.

RAPÔSO, C. F. L. et al. LGPD-Lei Geral de Prote-

ção de Dados Pessoais em Tecnologia da Informação: Revisão Sistemática. RACE-Revista de

Administração do Cesmac, v. 4, p. 58-67,

2019.

ROESCH, M. Snort: network intrusion detection

& prevention system. Network Intrusion Detection & Prevention System. Disponível em:

https://www.snort.org/. Acesso em: 22 maio

2021.

SARAVIA, E. Fundamentals of NLP: tokenization,

lemmatization, stemming, and sentence segmentation. 2020. Disponível em: https://dair.ai/

notebooks/nlp/2020/03/19/

nlp_basics_tokenization_segmentation.html.

Acesso em: 19 jun. 2021.

TABATABAEI, F.; WELLS, D. OSINT in the Context of Cyber-Security. In: AKHGAR, Babak et al

(ed.). Open Source Intelligence Investigation:

from strategy to implementation. [S.l.]: Springer,

2016. p. 213-231.

TRUONG, C. T., ZELINKA, I. A Survey on Artificial Intelligence in Malware as Next-Generation

Threats. MENDEL, v. 25, n. 2, p. 27-34, 20 dez.

2019.

Página 42

ULLAH, F. et al. Data exfiltration: a review of

external attack vectors and countermeasures.

Journal Of Network And Computer

Applications, [S.l.], v. 101, p. 18-54, 1 jan.

2018. Elsevier BV.

*Artigo realizado a partir do trabalho de

conclusão do Curso de Especialização em Guerra

Cibernética do Centro de Instrução de Guerra

Eletrônica – CIGE pelos Tenentes Isabelle Cecilia

de Andrade e Guilherme Resende Deus.

Endereço postal: DF-001, 5, Lago Norte.

Brasília, Distrito Federal – DF, CEP: 71559-902.

email: isabelleica@fab.mil.br,

guilhermegrd@fab.mil.br.

Published

2026-04-07

How to Cite

Isabelle Cecilia de Andrade, & Guilherme Resende Deus. (2026). Machine learning as a support for data exfiltration attacks: an integrated architecture for reducing detection risks. Data & Hertz, 2(2). Retrieved from https://ebrevistas.eb.mil.br/index.php/datahertz/article/view/14143