The project titled Deep Learning Based Topic Classification for Sensitivity Assignment to Personal Data, specific to Mobildev R&D Center, which was developed in ITU UHEM (National Center for High Performance Computing), a member of the  PRACE (Partnership for Advanced Computing in Europe) program supported by EU-HORIZON 2020, was completed under the supervisory of  Asst. Prof. Dr. Ayşe Tosun.

The technical report including the details of the project has been published on the website of the PRACE – SHAPE (SME HPC Adoption Programme in Europe).


Title: Deep Learning Based Topic Classification for Sensitivity Assignment to Personal Data

Authors:
  • Apdullah Yayık, Mobildev Research and Development, Istanbul
  • Hasan Apik, Mobildev Research and Development, Istanbul
  • Ayse Tosun, Faculty of Computer and Informatics Engineering, Istanbul Technical University, Istanbul
  • Enver Ozdemir, National Center for High Performance Computing, Istanbul Technical University, Istanbul
Abstract:
Knowing the topic of textual content before performing a natural language processing task enables the design of topic-specific pipelines. Since the topic is represented by all the sentences and words of the document, it can be accepted as a reference point that can describe the document alone. In this partnership, Mobildev successfully completed the construction and deployment of a topic classification model in order to assign a sensitivity level to the extracted personal data within the topic context. Since justice, health, and religion topics are considered as highly sensitive data by Personal Data Protection Rule, it is essential for the model to identify documents in these stated topics. Therefore, a publicly available dataset was chosen and, new document instances from these three important categories were added. Two state-of-the-art machine learning models for natural language processing tasks were assessed on the extended dataset: fasttext and bidirectional encoder representations transformers (BERT). The performance of the models and the computational costs for training at the server side are reported. After testing at the client side the most suitable model for a lightweight client operation is determined and deployed into the Mobildev’s existing platform. The model training and assessment have been successfully completed in collaboration with the National Center for High Performance Computing at Istanbul Technical University.