Semantic Web Publications - Texts as Data Services (Severi)
Project Goals
The project develops automatic annotation technology and tools by which texts can be transformed into
Linked Data services. The methods are tested and evaluated in practise by developing application demonstrators on top of the data services in four case study areas:
- Legal texts in the context of the Semantic Finlex project
- Norms in use in the construction industry
- Business news about law and technology innovations
- Publishing biographical materials on the Semantic Web
- Semantic media tracking in news, funded separately by VTS foundation
- Improving findability in web marketing using Schema.org, funded separately by VTS foundation
Research Plan
The project lasts Sept 1, 2016 - May 31st, 2018.
More detailed materials about the project and its results will be published on this home page later.
An abstract in Finnish is available below:
WWW on muuttumassa perinteisestä dokumenttien julkaisualustasta (Web of Documents) datan julkaisualustaksi (Web of Data). Ideana on media-aineistojen julkaiseminen verkossa ihmisluettavan tekstin ohella myös koneluettavana datana, mikä mahdollistaa sovellusten kehittämisen ja lisäarvon luomisen uudenlaisina palvelukon-septeina ja liiketoimintamalleina. Teknologisena haasteena on kuitenkin tekstiaineistojen rakenteistaminen dataksi, missä tarvitaan kieliteknologian ja semanttisen web-teknologian monitieteistä yhdistämistä.
Severi-hankkeessa luodaan avoin teknologinen perusta ja yhteistyöverkosto tekstiperustaisten verkkosisältöjen julkaisemiseksi semanttisina datapalveluina. Tutkimustyö tehdään hankkeessa mukana olevan yrityskonsortion tapaustutkimusten kautta sovellusalueina juridiset aineistot, rakennusalan normit, uutiset sekä e-kirjat. Hankkeen tulokset julkaistaan verkkopalveluina ja avoimella lisenssillä niiden maksimaaliseksi hyödyntämiseksi Suomessa. Hankkeessa on mukana myös laaja kansainvälinen huippuyliopistojen yhteistyöverkosto.
Consortium
The project consortium includes the following organizations:
- Aalto University, Department of Computer Science
- Edita Publishing Ltd
- CSC Ltd
- Heldig - Helsinki Centre for Digital Humanities
- Lingsoft Ltd
- Ministry of Justice
- Building Information Group Ltd
- Finnish Literature Society (SKS)
- Svenska Littetursällskapet i Finland (SLS)
- Tekniikan akateemiset TEK
- YLE Ltd
Thanks to Tekes for making the project financially possible.
The project Steering Group includes the following representatives:
Sari Korhonen (Edita),
Pirjo-Leena Forsström (CSC),
Tiina Lindh-Knuutila and Juhani Reiman (Lingsoft)
Aki Hietanen (Ministry of Justice),
Jouko Kanerva (Building Information Group),
Kirsi Keravuori (SKS),
Karola Söderman (SLS),
Pekka Pellinen (TEK),
Pia Virtanen (YLE), and
Eero Hyvönen (Aalto).
Aki Parviainen is the project representative at Tekes.
Contact Person
Prof. Eero Hyvönen, Director , Aalto University and University of Helsinki, Heldig
Publications
2023
Minna Tamper, Petri Leskinen, Eero Hyvönen, Risto Valjus and Kirsi Keravuori:
Analyzing Biography Collection Historiographically as Linked Data: Case National Biography of Finland. Semantic Web – Interoperability, Usability, Applicability, vol. 14, no. 2, pp. 385-419, IOS Press, 2023.
bib pdf link
2022
2021
Minna Tamper, Eero Hyvönen and Petri Leskinen:
Visualizing and Analyzing Networks of Named Entities in Biographical Dictionaries for Digital Humanities Research.
Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICling 2019), Springer-Verlag, October, 2021. Forth-coming.
bib pdf This paper shows how named entity extraction and networkanalysis can be used to examine biographies individually and in groupsto aid historians in biographical and prosopographical research. For this purpose a reference network of 13 100 biographies in the collections ofthe Biographical Centre of the Finnish Literature Society was created, based on links between the biographies as well as automatically extracted named entities found in the texts. The data was published in a SPARQL endpoint as a Linked Data knowledge graph on top of which network analytic tools were created and analysis were done showing the usefulness of the approach in Digital Humanities. The reference graph has been utilized for network analysis to examine egocentric networks of individual persons as well as networks among groups of people in prosopography. The data and tools presented are in use since autumn 2018 in the semantic portal BiographySampo that has had tens of thousands of users.
2019
Arttu Oksanen, Jouni Tuominen, Eetu Mäkelä, Minna Tamper, Aki Hietanen and Eero Hyvönen:
Semantic Finlex: Transforming, Publishing, and Using Finnish Legislation and Case Law As Linked Open Data on the Web.
Knowledge of the Law in the Big Data Age (G. Peruginelli and S. Faro (eds.)), Frontiers in Artificial Intelligence and Applications, vol. 317, pp. 212-228, IOS Press, 2019. ISBN 978-1-61499-984-3 (print); ISBN 978-1-61499-985-0 (online).
bib pdf link Governments publish legislation and case law widely in print and on the Web. Such legal information is provided for human consumption, but the information is usually not available as data for algorithmic analysis and applications to use. However, this would be beneficial in many use cases, such as building more intelligent juridical online services and conducting research into legislation and legal practice. To address these needs, this Chapter presents Semantic Finlex, a national in-use data resource and service for publishing Finnish legislation and related case law as Linked Open Data for legal applications to use. The system transforms and interlinks on a regular basis data from the legacy legal database Finlex of the Ministry of Justice into Linked Open Data, based on the European standards ECLI and ELI. The published data is hosted on the 7-star Linked Data Finland service and SPARQL endpoint with a variety of related services available that ease data re-use. Rich Internet Applications using SPARQL for data access are presented as application demonstrators of the data service. In addition, this Chapter presents methods and tools under development to automatically annotate legal texts and to anonymize case law documents prior to their publication on the Web. Anonymization is necessary due to issues of data protection and privacy, and annotation is needed for semantic search and interlinking the documents. The automated approaches could significantly speed up the process and minimize costs of publishing legal documents as Linked Open Data.
Agata Dominowska, Elsi Hyttinen, Peter Ivanics, Mikko Koho, Ilona Pikkanen and Risto Turunen:
Hiding in Plain Sight: Poetry in Newspapers and How to Approach it. Human IT: Journal for Information Technology Studies as a Human Science, vol. 14, no. 2, pp. 145-171, University of Borås, July, 2019.
bib link
Petri Leskinen and Eero Hyvönen:
Extracting Genealogical Networks of Linked Data from Biographical Texts.
The Semantic Web: ESWC 2019 Satellite Events (Hitzler, P., Kirrane, S., Hartig, O., de Boer, V., Vidal, M.-E., Maleshkova, M., Schlobach, S., Hammar, K., Lasierra, N., Stadtmüller, S., Hose, K., Verborgh, R. (ed.)), pp. 121-125, Springer, June, 2019.
bib pdf
Eero Hyvönen, Petri Leskinen, Minna Tamper, Heikki Rantala, Esko Ikkala, Jouni Tuominen and Kirsi Keravuori:
BiographySampo - Publishing and Enriching Biographies on the Semantic Web for Digital Humanities Research.
The Semantic Web. ESWC 2019 (Pascal Hitzler, Miriam Fernández, Krzysztof Janowicz, Amrapali Zaveri, Alasdair J.G. Gray, Vanessa Lopez, Armin Haller and Karl Hammar (eds.)), pp. 574-589, Springer-Verlag, June, 2019.
bib pdf link
Eero Hyvönen, Petri Leskinen, Minna Tamper, Heikki Rantala, Esko Ikkala, Jouni Tuominen and Kirsi Keravuori:
Demonstrating BiographySampo in Solving Digital Humanities Research Problems in Biography and Prosopography.
The Fourth Digital Humanities in the Nordic Countries 2019 (DHN2019), Book of Abstracts, University of Copenhagen, Copenhagen, Denmark, March, 2019.
bib pdf link
2018
Petri Leskinen, Eero Hyvönen and Jouni Tuominen:
Analyzing and Visualizing Prosopographical Linked Data Based on Biographies.
Proceedings of the Second Conference on Biographical Data in a Digital World 2017 (BD2017), vol. 2119, pp. 39-44, CEUR Workshop Proceedings, Linz, Austria, 2018.
bib pdf link This paper shows how faceted search on biographical data can be utilized as a flexible basis for filtering target groups of people and, in particular, how generic data analysis and visualization tools can then be applied for solving prosopographical research questions based on the filtered data. This idea is demonstrated and evaluated in practice by presenting two application case studies: 1) linked data extracted from a printed registry of over 10 000 alumni (1867–1992) of the prominent Finnish high school Norssi, and 2) a knowledge graph extracted from 13 000 short biographies of significant Finnish people (from 3rd century to present times) in the National Biography of Finland. In both cases, the data is enriched by linking their entities with several other external datasets.
Jouni Tuominen, Eero Hyvönen and Petri Leskinen:
Bio CRM: A Data Model for Representing Biographical Data for Prosopographical Research.
Proceedings of the Second Conference on Biographical Data in a Digital World 2017 (BD2017), vol. 2119, pp. 59-66, CEUR Workshop Proceedings, Linz, Austria, 2018.
bib pdf link Biographies make a promising application case of Linked Data: they can be used, e.g., as a basis for Digital Humanities research in prosopography and as a key data and linking resource in semantic Cultural Heritage (CH) portals. In both use cases, a semantic data model for harmonizing and interlinking heterogeneous data from different sources is needed. This paper presents such a data model, Bio CRM, with the following key ideas: 1) The model is a domain specific extension of CIDOC CRM, making it applicable to not only biographical data but to other CH data, too. 2) The model makes a distinction between enduring unary roles of actors, their enduring binary relationships, and perduing events, where the participants can take different roles modeled as a role concept hierarchy. 3) The model can be used as a basis for semantic data validation and enrichment by reasoning. 4) The enriched data conforming to Bio CRM is targeted to be used by SPARQL queries in a flexible ways using a hierarchy of roles in which participants can be involved in events.
Minna Tamper, Petri Leskinen, Kasper Apajalahti and Eero Hyvönen:
Using Biographical Texts as Linked Data for Prosopographical Research and Applications.
Digital Heritage. Progress in Cultural Heritage: Documentation, Preservation, and Protection. 7th International Conference, EuroMed 2018, Nicosia, Cyprus (Marinos Ioannides, Eleanor Fink, Raffaella Brumana, Petros Patias, Anastasios Doulamis, João Martins and Manolis Wallace (eds.)), pp. 125-137, Springer-Verlag, November, 2018.
bib pdf link
Eero Hyvönen, Petri Leskinen, Minna Tamper, Heikki Rantala, Esko Ikkala, Jouni Tuominen and Kirsi Keravuori:
Biografiasammon tekoäly yhdistää ja rikastaa suomalaiset elämäkerrat semanttisessa webissä. Aalto-yliopisto, Semanttisen laskennan tutkimusryhmä (SeCo), Nov, 2018.
bib pdf Biografiasampo-järjestelmä käynnistää uuden aikakauden elämäkertakokoelmien julkaisemisessa ja käyttämisessä verkossa. Järjestelmän ydinaineistona on Kansallisbiografia ja muut Suomalaisen Kirjallisuuden Seuran (SKS) ja tieteellisten seurojen toimittamat pienoiselämäkerrat, yhteensä 13 100 elämäntarinaa, joita on kirjoittanut 900 suomalaista tutkijaa. Biografiasammon innovaationa on luoda kieliteknologian, tekoälyn ja semanttisen webin teknologioiden avulla elämäkertojen teksteistä ja niihin eri lähteissä liittyvistä tiedoista tietämysverkko (knowledge graph) ja kansallinen tietoinfrastruktuuri, joka koostuu miljoonista tietojen välisistä yhteyksistä. Tietämysverkko on julkaistu linkitetyn datan palvelussa, jonka varaan on toteutettu seitsemästä sovellusnäkymästä koostuva älykäs, kaikille avoin ja maksuton verkkopalvelu biografiasampo.fi kansalaisten ja digitaalisten ihmistieteiden tutkijoiden käytettäväksi.
Arttu Oksanen, Jouni Tuominen, Eetu Mäkelä, Minna Tamper, Aki Hietanen, and Eero Hyvönen:
Semantic Finlex: Finnish Legislation and Case Law as a Linked Open Data Service.
Proceedings of Law via the Internet 2018 (LVI 2018), Knowledge of the Law in the Big Data Age, abstracts, Florence, Italy, October, 2018.
bib pdf
Eero Hyvönen, Petri Leskinen, Minna Tamper, Jouni Tuominen and Kirsi Keravuori:
Semantic National Biography of Finland.
Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference (DHN 2018), pp. 372-385, CEUR Workshop Proceedings, Vol-2084, Helsinki, Finland, March, 2018.
bib pdf link This paper presents the vision of publishing and utilizing textual biographies as Linked (Open) Data on the Semantic Web. As a case study, we publish the live stories of the National Biography of Finland, created by the Finnish Literature Society, as semantic, i.e., machine “understandable” metadata in a SPARQL endpoint using the Linked Data Finland (LDF.fi) service. On top of the data service various Digital Humanities applications are built. The applications include searching and studying individual personal histories as well as historical research of groups of persons using methods of prosopography. The biographical data is enriched by extracting events from unstructured and semi-structured texts, and by linking entities internally and to external data sources. A faceted semantic search engine is provided for filtering groups of people from the data for prosopographical research. An extension of the event-based CIDOC CRM ontology is used as the underlying data model, where lives are seen as chains of interlinked events populated from the data of the biographies and additional data sources, such as museum collections, library databases, and archives.
2017
Petri Leskinen, Jouni Tuominen, Erkki Heino and Eero Hyvönen:
An Ontology and Data Infrastructure for Publishing and Using Biographical Linked Data.
Proceedings of the Workshop on Humanities in the Semantic Web (WHiSe II), pp. 15-26., CEUR Workshop Proceedings, Vol. 2014, Vienna, Austria, October, 2017.
bib pdf link This paper describes the ontology model and published datasets of a digitized biographical person register. The applied ontology model is designed to represent people via their enduring roles and perduring lifetime events. The model is designed to support 1) prosopographical Digital Humanities research, 2) linking to resources in semantic Cultural Heritage portals, and 3) semantic data validation and enrichment by using SPARQL queries. The linked data approach enables to enrich a person s biography by interlinking it with space and time related biographical events, persons relating by social contacts or family relations, historical events, and personal achievements.
Minna Tamper, Petri Leskinen, Esko Ikkala, Arttu Oksanen, Eetu Mäkelä, Erkki Heino, Jouni Tuominen, Mikko Koho and Eero Hyvönen:
AATOS – a Configurable Tool for Automatic Annotation.
Proceedings, Language, Data and Knowledge (LDK 2017), pp. 276-289, Springer-Verlag, Galway, Ireland, June, 2017.
bib pdf link This paper presents an automatic annotation tool AATOS for providing documents with semantic annotations. The tool links entities found from the texts to ontologies defined by the user. The application is highly configurable and can be used with different natural language Finnish texts. The application was developed as a part of WarSampo and Semantic Finlex projects and tested using Kansa Taisteli magazine articles and consolidated Finnish legislation of Semantic Finlex. The quality of the automatic annotation was evaluated by measuring precision and recall against existing manual annotations. The results showed that the quality of the input text, as well as the selection and configuration of the ontologies impacted the results.
Erkki Heino, Minna Tamper, Eetu Mäkelä, Petri Leskinen, Esko Ikkala, Jouni Tuominen, Mikko Koho and Eero Hyvönen:
Named Entity Linking in a Complex Domain: Case Second World War History.
Proceedings, Language, Data and Knowledge (LDK 2017), pp. 120-133, Springer-Verlag, Galway, Ireland, June, 2017.
bib pdf link This paper discusses the challenges of applying named entity linking in a rich, complex domain – specifically, the linking of 1) military units, 2) places and 3) people in the context of rich Second World War data. Multiple sub-scenarios are discussed in detail through concrete evaluations, analyzing the problems faced, and the solutions developed. A key contribution of this work is to highlight the heterogeneity of problems and approaches needed even inside a single domain, depending on both the source data as well as the target authority.
Eero Hyvönen, Petri Leskinen, Erkki Heino, Jouni Tuominen and Laura Sirola:
Reassembling and Enriching the Life Stories in Printed Biographical Registers: Norssi High School Alumni on the Semantic Web.
Proceedings, Language, Data and Knowledge (LDK 2017), pp. 113-119, Springer-Verlag, Galway, Ireland, June, 2017.
bib pdf link This paper presents the idea to enrich printed biographical person registers with linked data related to events that took place after the register was published. By transforming printed historical documents into structured data, semantic search to written texts can be provided for the reader. Even more importantly, life stories of historical persons can be extended based on data linking by extracting semantic structures from printed texts, and by combining this data with external datasets and data services. Such linking provides an enriched context for prosopographical research on people in the register, as well as an enhanced reading experience for anyone interested in reading the biographies. As a concrete case study, a register 1867–1992 of over 10 000 alumni of the prominent Finnish high school “Norssi” was transformed into RDF, was enriched by data linking, was published as a linked data service, and is provided to end users via a faceted search engine and browser for studying lives of historical persons and for prosopographical research.