Blog annotation: from corpus analysis to automatic tag suggestion

Ivan Garrido-Marquez Jorge J. García Flores 1 François Lévy 1 Adeline Nazarenko 1
1 RCLN
LIPN - Laboratoire d'Informatique de Paris-Nord
Abstract : Nowadays, blogs cover a large audience and they raised from the underground to become part of mainstream media. Blogs contain information on diverse topics, personal opinions, and discussions between bloggers and readers. Tags and categories are structural elements of a blog post that increase the blog's visibility, enhance navigation and searching within the blog history. We suppose that those annotations are made on subjective grounds rather than in a systematic way. Even if there are tools to help bloggers to tag and categorize their posts, we still don't know to which extent these tools take into account information contained in previous posts. This paper presents a 11 million word corpus of blogs posts in French dedicated to study these questions, and an experiment in tag and category prediction. Preliminary results show that around 27\% of the overall tags can be predicted from lexical frequency analysis of blog posts. However, a first comparison experience with an existing tag suggestion tool shows that an important proportion of the tags used for blog description are not present in the blog post. This shows that tag suggestion tools should exploit the diachronic analysis of blogs.
Type de document :
Communication dans un congrès
17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2016), Apr 2016, Konya, Turkey
Liste complète des métadonnées

https://hal-auf.archives-ouvertes.fr/hal-01358328
Contributeur : François Lévy <>
Soumis le : mercredi 31 août 2016 - 14:54:41
Dernière modification le : jeudi 11 janvier 2018 - 06:26:42

Identifiants

  • HAL Id : hal-01358328, version 1

Collections

Citation

Ivan Garrido-Marquez, Jorge J. García Flores, François Lévy, Adeline Nazarenko. Blog annotation: from corpus analysis to automatic tag suggestion. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2016), Apr 2016, Konya, Turkey. 〈hal-01358328〉

Partager

Métriques

Consultations de la notice

145