Skip to Main content Skip to Navigation
New interface
Conference papers

Blog annotation: from corpus analysis to automatic tag suggestion

Abstract : Nowadays, blogs cover a large audience and they raised from the underground to become part of mainstream media. Blogs contain information on diverse topics, personal opinions, and discussions between bloggers and readers. Tags and categories are structural elements of a blog post that increase the blog's visibility, enhance navigation and searching within the blog history. We suppose that those annotations are made on subjective grounds rather than in a systematic way. Even if there are tools to help bloggers to tag and categorize their posts, we still don't know to which extent these tools take into account information contained in previous posts. This paper presents a 11 million word corpus of blogs posts in French dedicated to study these questions, and an experiment in tag and category prediction. Preliminary results show that around 27\% of the overall tags can be predicted from lexical frequency analysis of blog posts. However, a first comparison experience with an existing tag suggestion tool shows that an important proportion of the tags used for blog description are not present in the blog post. This shows that tag suggestion tools should exploit the diachronic analysis of blogs.
Complete list of metadata
Contributor : François Lévy Connect in order to contact the contributor
Submitted on : Wednesday, August 31, 2016 - 2:54:41 PM
Last modification on : Thursday, March 31, 2022 - 4:08:02 PM


  • HAL Id : hal-01358328, version 1


Ivan Garrido-Marquez, Jorge J. García Flores, François Lévy, Adeline Nazarenko. Blog annotation: from corpus analysis to automatic tag suggestion. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2016), Pascale Fung; Tomas Mikolov; Simone Teufel; Piek Vossen, Apr 2016, Konya, Turkey. ⟨hal-01358328⟩



Record views