Topic/S – A Topic and Trend Recognition Approach in News-Media
Abstract: The research project Topic/S is geared towards the topic and trend recognition with the help of a continuous semantic analysis of news content from publishing houses. We strive for replacing the user-initiated keyword-based search by a system-generated list of topics and trends to reduce search efforts in the daily business of editors. The foundation for identifying a topic is the semantic information extraction using NER, LOD resources, and clustering. We iteratively evaluate our approach with the help of a prototypical, widget-based user interface and got first positive feedback from associated partners.
Today, the investigation effort in online and print newsrooms is enormous due to the high number of incoming agency reports, information from social media, and other channels. For example, the text input from news agencies of an associated partner, which provides our test data, averages out at 2000 items per day. Therefore, it becomes more and more difficult to detect current topics and to track their trend. Thus, in our research project Topic/S we strive for a semiautomatic topic and trend detection which is not provided by current keyword-based search tools. To achieve our goal, we successfully solved the tasks of 1) a high quality content analysis through the use of LOD resources, 2) the modeling of an appropriate media-ontology, and 3) the clustering of documents to allow for a topic and trend recognition.
In order to extract structured information from the received media objects, tools for Named Entity Recognition (NER) and Named Entity Disambiguation (NED) were evaluated. A variety of NER and NED tools exist but the recognition quality for professional solutions in the context of the German language is not sufficient for our business use case. Thus, we decided to employ a combination of a statistical NER tool (Stanford NER tool) and a dictionary-based approach (Lingpipe), which guarantees the detection of important entities and keywords.
To fill our dictionaries, we investigated the possibilities of the LOD cloud. Since we identified several issues with public SPARQL endpoints, e.g., availability, performance, or inference support, we set up our own semantic repository which hosts in particular parts of the YAGO2, Dbpedia, and GeoNames data sets. This approach allows for correcting errors during the data migration as well as to integrate new entities, for instance of only locally known persons which often are missing within the LOD.
The facts generated in our workflow, e.g., topics, trends, and their interlinkage, are stored in a separate semantic repository using the Oracle 11g, which allows for a continuous (graph-based) data mining. Currently, it comprises about 10,3 Mio triples for our test data set of about 200.000 articles. To store the information, a dedicated ontological structure was required, but after evaluating existing media ontologies, such as the Ontology for Media Resources or SNaP, it became clear that these are not sufficient for our newsroom use case. Thus, we designed a new media ontology comprising concepts to describe the life cycle of media objects and news items. To provide interoperability, it integrates existing concepts from Schema.org and IPTC rNews. Available and established thesauri in the publishing group were integrated on the basis of SKOS.
For the detection of topics, a hierarchical clustering algorithm was developed that assesses the thematic similarity of content and groups it accordingly. It employs weighted relations between news content based on both named entities and qualified keywords. They are used to generate meaningful labels, too. This approach differs from the choice of a representative member for a news topic, e.g., uses within Google News. Due to the ongoing analysis of incoming articles, a topic profile is created, which allows for performing a trend analysis on each topic. Thus, we can determine whether it is currently a major issue or not. Within our talk we will give a detailed technical view on our approach and discuss design decisions. All in all, we prove the beneficial combination of existing information extraction tools and LOD in the domain of news content and show how these results can be leveraged for a semiautomatic topic- and trend-recognition.
Presenter CV: I finihed studying Media Computer Sience at TU Dresden in 2012. Afterwards i started working on the Topic/S research project as an software developer at Fink & Partner Media Services GmbH.
Pingback: I-Semantics & Informatik 2013 – Fink & Partner präsentiert Forschungsergebnisse | Fink & Partner Blog