EU-HORIZON 2020 kapsamında desteklenen
PRACE (Partnership for Advanced Computing in Europe) programının bir üyesi olan İTÜ UHEM’ın yüksek başarımlı hesaplama merkezindeki GPU kaynakları kullanılarak geliştirilen,
Mobildev AR-GE Merkezine özgü
Derin Öğrenme Tabanlı Doküman Sınıflandırma ve Hassas Kişisel Bilgi Tespit Etme başlıklı çalışma Dr. Öğr. Üyesi Ayşe Tosun’un danışmanlığında tamamlandı. Projenin çıktısı olan ürün Mobildev tarafından kullanılmaya başlandı.
Sistemin detaylarını özetleyen teknik rapor PRACE – SHAPE (SME HPC Adoption Programme in Europe) programına ait websitesinde
yayınlandı.
Title: Deep Learning Based Topic Classification for Sensitivity Assignment to Personal Data
Authors:
- Apdullah Yayık, Mobildev Research and Development, Istanbul
- Hasan Apik, Mobildev Research and Development, Istanbul
- Ayse Tosun, Faculty of Computer and Informatics Engineering, Istanbul Technical University, Istanbul
- Enver Ozdemir, National Center for High Performance Computing, Istanbul Technical University, Istanbul
Abstract:
Knowing the topic of textual content before performing a natural language processing task enables the design of topic-specific pipelines. Since the topic is represented by all the sentences and words of the document, it can be accepted as a reference point that can describe the document alone. In this partnership, Mobildev successfully completed the construction and deployment of a topic classification model in order to assign a sensitivity level to the extracted personal data within the topic context. Since justice, health, and religion topics are considered as highly sensitive data by Personal Data Protection Rule, it is essential for the model to identify documents in these stated topics. Therefore, a publicly available dataset was chosen and, new document instances from these three important categories were added. Two state-of-the-art machine learning models for natural language processing tasks were assessed on the extended dataset: fasttext and bidirectional encoder representations transformers (BERT). The performance of the models and the computational costs for training at the server side are reported. After testing at the client side the most suitable model for a lightweight client operation is determined and deployed into the Mobildev’s existing platform. The model training and assessment have been successfully completed in collaboration with the National Center for High Performance Computing at Istanbul Technical University.