CSC @ Google Summer of Code (GSoC)

Auch in diesem Jahr ist Corporate Smart Content wieder aktiv am DBpedia Projekt im Google Summer of Code beteiligt. Unsere beiden Beiträge:
Wojciech Lukasiewicz - Combining DBpedia and Topic Modelling
Vincent Bohlen: A Hybrid Classifier/Rule-based Event Extractor for DBpedia

Wojciech Lukasiewicz - Combining DBpedia and Topic Modelling

DBpedia, a crowd- and open-sourced community project extracting the content from Wikipedia, stores this information in a huge RDF graph. DBpedia Spotlight is a tool which delivers the DBpedia resources that are being mentioned in the document.

Using DBpedia Spotlight to extract and disambiguate Named Entities from Wikipedia articles and then applying a topic modelling algorithm (e.g. LDA) with URIs of DBpedia resources as features would result in a model, which is capable of describing the documents with the proportions of the topics covering them. But because the topics are also represented by DBpedia URIs, this approach could result in a novel RDF hierarchy and ontology with insights for further analysis of the emerged subgraphs.

The direct implication and first application scenario for this project would be utilizing the inference engine in DBpedia Spotlight, as an additional step after the document has been annotated and predicting its topic coverage.

Vincent Bohlen: A Hybrid Classifier/Rule-based Event Extractor for DBpedia Proposal

In modern times the amount of information published on the internet is growing to an immeasurable extent. Humans are no longer able to gather all the available information by hand but are more and more dependent on machines collecting relevant information automatically. This is why automatic information extraction and in especially automatic event extraction is important. In this project I will implement a system for event extraction using Classification and Rule-based Event Extraction. The underlying data for both approaches will be identical. I will gather wikipedia articles and perform a variety of NLP tasks on the extracted texts. First I will annotate the named entities in the text using named entity recognition performed by DBpedia Spotlight. Additionally I will annotate the text with Frame Semantics using FrameNet frames. I will then use the collected information, i.e. frames, entities, entity types, with the aforementioned two different methods to decide if the collection is an event or not.