S-HMC Maintenance

Ces travaux ont fait l'objet de la thèse de Rafael Peixoto.

Contexte

La production de données croit de manière exponentielle donnant plus de crédit au phénomène des Big Data. Le terme masses de données ou Big Data fait référence aux jeux de données de très grandes tailles caractérisées par plusieurs dimensions (Volume, Velocité, Variété). Ces données ou ﬂux de données sortent du cadre traditionnel des approches standards de traitement. C’est pourquoi l’extraction de connaissances et leur valeur de cette extraction sont des éléments complexes à traiter à la fois pour les industriels et les consommateurs pour réduire la surcharge d’information.

Bien que les méthodes traditionnelles d’analyse de données dans le champ de recherche de la fouille de données semblent être adaptées pour le traitement des données massives (cf. la classiﬁcation hiérarchique multiétiquette), ces méthodes échouent dans l’identiﬁcation de la sémantique des données formant une compréhension abstruse et réduisant la valeur. Les ontologies forment une solution pour les systèmes opérant proche des niveaux de conceptualisation humaine, et représente une des formes les plus acceptées pour décrire et capturer la sémantique dans les communautés scientiﬁques du Web sémantique et les industriels exploitant ces technologies.

L’association de ces technologies avec les technologies du traitement des données massives est la clé de l’approche proposée pour améliorer l’extraction de valeur dans le Big Data réduisant ainsi l’écart entre les perspectives des utilisateurs et les méthodes d’analyse (cf. les modèles de classiﬁcation). Les objectifs de ces travaux sont doubles. D’une part, ces travaux proposent une nouvelle méthode de classiﬁcation pour la multiclassiﬁcation hiérarchique d’articles selon une ontologie enrichie en règle et l’usage des méthodes du "Web Resoning". D’autre part, ces travaux proposent une approche pour maintenir et faire évoluer le système de classiﬁcation dans le contexte du Big Data et des ﬂux de données non stationnaires.

The research scope

Big Data analysis can be deemed as the analysis technique for a special kind of data. The analysis of Big Data is the ﬁnal and most important phase in the value chain of Big Data, with the purpose of extracting values that provide suggestions or decisions [Chen et al., 2014]. Therefore, many traditional data analysis methods such as Data Mining Algorithms (Classiﬁcation, clustering, regression, among others) may still be utilized for Big Data Analysis [Chen et al., 2014]. The aim is to study the maintainability of a classiﬁcation process that classiﬁes data items according to an ontology-described classiﬁcation model using a web reasoner in the context of Big Data. Therefore three main research areas deﬁne the scope of this research.

Big Data: The term of Big Data is mainly used to describe enormous datasets characterized by an increasing number of V’s (Volume, Velocity, Variety, Veracity, Value). Such amount of data requires new forms of processing to enable enhanced decision-making, insight discovery and process optimization. This thesis focuses on automatically classiﬁed data items in the context of Big Data.
Classiﬁcation: Classiﬁcation is a machine-learning method used to predict different classes according to some constrains and to create a model to classify newly available data. In recent years, many approaches have been proposed to further improve classiﬁcation performance by incorporating label correlations or exploiting label hierarchy. Among all existing types of classiﬁcation, this thesis focuses on a speciﬁc type of classiﬁcation that exploits connections in label hierarchy called Hierarchical Multi-Label Classiﬁcation (HMC).
Ontologies: Ontologies allow the deﬁnition of terms and meanings used to represent areas of knowledge. Ontologies are a good solution for intelligent compsystems that operate close to a human concept level bridging the gap between human requirements and the computational requirements [Obrst, 2003]. From an ontology scope, this thesis aims to use ontologies to describe the classiﬁcation model, ontology evolution to evolve the classiﬁcation model according to data streams, and web reasoning to classify the items.

Since scalability is important in Big Data context, simple but highly scalable techniques are used in order to improve it compared to traditional classiﬁcation approaches. The used techniques must allow the parallelization of the process and distribute it across several loosely coupled machines. The number of labels in many domains keeps growing during this, and even simple approaches can easily become computationally infeasible, not to mention the more sophisticated and computationally demanding approaches. To automatically analyse and describe data items in Big Data context, the number of labels can be even bigger. To study the maintainability of the classiﬁcation process, two main approaches of classiﬁcation exist [Ditzler et al., 2015]:

The ﬁrst consists in learning the classiﬁcation model from a static set of items (batch learning). In this case, a static classiﬁcation model is created and all new documents are classiﬁed according to that classiﬁcation model. Hence, no change to the classiﬁcations is made from one moment to another moment. E.g. once an item is classiﬁed with a set of labels, those labels will remain the same independently of the moment of the classiﬁcation.
The second consists in incrementally learning the classiﬁcation model according to a stream of items. In this case,the classiﬁcation model evolves with new documents used to train the classiﬁer. Hence, the classiﬁcation result for an item can change. E.g. an item classiﬁed in different moments can be classiﬁed with different labels.

Learning classification

The Semantic HMC is composed of ﬁve individually scalable steps to reach the aims of Big Data analytics:

SHMCAchi

Indexation extracts terms from data items and creates an index of data items.
Vectorization calculates the term-frequency vectors of the indexed items.
Hierarchization creates the label taxonomy (i.e. subsumption hierarchy) using term-frequency vectors.
Resolution describes taxonomy concepts using relevant terms and creates the reasoning rules to classify data items with labels based on term-frequency vectors.
Realization populates the ontology with items and then determines, for each item, the most speciﬁc label and all its subsuming labels.

Adaptive process according to a stream of items

The classiﬁcation model’s adaptive process is the set of activities necessary to maintain the classiﬁcation model in accordance with the data stream, but independent of any particularities of the modiﬁcation and model change types. The process uses a Single-model incremental approach [Hulten et al., 2001, Wang, 2006] where the classiﬁcation model is incrementally adapted regarding new data.

SHMCmaintenance

Publications

Adaptive learning process for the evolution of ontology-described classification model in big data context, R Peixoto, C Cruz, N Silva, SAI Computing Conference (SAI), 2016, 532-540
An unsupervised classification process for large datasets using web reasoning R Peixoto, T Hassan, C Cruz, A Bertaux, N Silva, Proceedings of the International Workshop on Semantic Big Data, 9
Analyse Sémantique du Big Data par Classification Hiérarchique Multi-Label H Thomas, R Peixoto, C Cruz, A Bertaux, N Silva extraction et Gestion des Connaissances
Extraction de la Valeur des données du Big Data par classification multi-label hiérarchique sémantique T Hassan, R Peixoto, C Cruz, A Bertaux, N Silva, 12ème atelier sur la Fouille de Données Complexes (FDC)
Hierarchical Multi-Label Classification Using Web Reasoning for Large Datasets, R Peixoto, T Hassan, C Cruz, A Bertaux, N Silva, Open Journal of Semantic Web (OJSW) 3 (1), 1-15, 2016
Semantic HMC for big data analysis, T Hassan, R Peixoto, C Cruz, A Bertaux, N Silva, Big Data (Big Data), 2014 IEEE International Conference on, 26-28, 2014
Semantic hmc for business intelligence using cross-referencing R Peixoto, H Thomas, C Cruz, A Bertaux, N Silva, 14th International Conference on Informatics in Economy, 2015
Semantic HMC: a predictive model using multi-label classification for big data, R Peixoto, T Hassan, C Cruz, A Bertaux, N Silva, Trustcom/BigDataSE/ISPA, 2015 IEEE 2, 173-179,2015
Semantic HMC: Ontology-Described Hierarchy Maintenance in Big Data Context, R Peixoto, C Cruz, N Silva, OTM Confederated International Conferences" On the Move to Meaningful, 2015

Les partenaires

Pr. Nuno Silva ISEP - Instituto Superior de Engenharia do Porto, Portugal