» print this page!
» Follow us on Twitter
» Be our friend on Facebook

Latest News

Latest Publications

SeCo on Twitter

SeCo on Facebook

Enabling a Virtuous Cycle of Research in the Humanities (VISCERA)

The goal of the VISCERA project is to develop tools that support each step in the lifecycle of a digital humanities project, as well as to crucially enable the results of such research to be published as data for others to use in further research.

To ensure the tools developed meet the needs of humanities users, they are being developed iteratively in relation to concrete case studies in collaborating digital humanities projects. The task of the computer scientist is then to see beyond these individual studies; to identify common components that allow the tools to generalize beyond the projects under immediate scrutiny.

Context

The core project is lead by D.Sc. Eetu Mäkelä, under funding from the Academy of Finland for 2015-2018. However, as described, the project closely interacts with multiple others, listed below.

Linked Open Data Science Service (LODsci), 2015-2016

This project, underway at the Semantic Computing Research Group of Aalto University is funded by the Finnish Ministry of Education and Culture as part of the national Open Science and Research initiative. The aims of this project closely align with those of VISCERA, in aiming to develop tools and support for publishing and using scientific data in the context of a Linked Open Data service.

Reassembling the Republic of Letters, 2014-2018

An EU COST Action with 31 participating countries seeking to plan a state-of-the-art digital infrastructure within which to collect a pan-European pool of highly granular data on the Republic of Letters. This involves designing tools for collecting, standardizing, navigating, analysing, and visualizing unprecedented quantities of epistolary data, and for facilitating new forms of international and interdisciplinary scholarly collaboration.

Cultures of Knowledge, 2009-2017

Based at the University of Oxford and funded by the Andrew W. Mellon Foundation, the CofK project develops the Early Modern Letters Online (EMLO) database, which intends to act as a hub for collecting metadata on the Republic of Letters. Here, work has thus far focused on ensuring data quality: supporting the strong identification of people, places and letters in data to be entered, as well as discovering duplicates already in the database. In the future, work would also focus on tools to support visualizations and research being done based on the database.

Interfacing Structured and Unstructured Data in Sociolinguistic Research on Language Change (STRATAS), 2016-2019

Funded by the Academy of Finland, this project integrates and aligns VISCERA tool development to the field of historical sociolinguistics. Partners in the project come from the Research Unit for the Study of Variation, Contacts and Change in English, as well as the Department of Finnish, Finno-Ugrian and Scandinavian Studies at the University of Helsinki. Additional support on user interface development is gained from the Tampere Unit for Computer-Human Interaction (TAUCHI) at the University of Tampere. In addition to the above mentioned projects, collaboration is also ongoing without dedicated funding with the following institutions:

Humanities+Design Research Laboratory, Stanford University

In collaboration with the Humanities+Design laboratory at Stanford University, a version of the VISCERA tools is being designed that targets individual humanities scholars. Using the tools to be developed, the scholars should be able to import existing rich structured data for their own research. Having cleaned, explored and expanded that data to make grounded inferences, they could then also finally publish their interpreted data for others to use.

University of Colorado Boulder

In collaboration with the University of Colorado Boulder, a long-standing collaboration is continuing on improving access to and understanding of historical primary sources, in this instance relating to the First World War. Technically, this project deals with bridging the gap from OCRed primary source material to structured collection metadata in e.g. Europeana and the Digital Public Library of America. Concretely, a contextual reader interface has been developed, where concepts and named entities such as events, people and places are automatically extracted from the sources under study, and additional information on them, as well as other sources pertaining to them are presented for the user.

Khepri

Previously, most of the work in the project has been preparatory, where requirements have been gathered by going over the humanities research processes, and the functionalities to be developed simulated through ad-hoc disconnected components, tied together and supplemented by manual work of the computer scientist.

Through these collaborations, a prevalent common process of inquiry was identified - the need to explore, as well as contrast differently constrained subsets of a dataset. As concrete examples, this might be looking at the correspondence networks of different individuals and comparing them, or looking at how possible values of a linguistic variable behave with respect to associated metadata and each other.

Now, a tool, nickmed Khepri, is being developed that could support this process. To ensure that the tool to be developed meets the variant needs of the associated projects, the intention is to develop a modular set of components that can be connected and configured to respond to the needs of a particular humanities task and data.

To support this, the Khepri tool utilizes the view-based querying paradigm, where data is presented simultaneously from different perspectives, with each perspective acting both as a visualization as well as a means to constrain what is shown in all the views. A proper implementation of the paradigm also allows for quick informed variation of query parameters, and thus dynamic exploration.

Because the views interact in a defined way, they can be developed as separate components targeting the major visualization classes such as geographical, temporal or statistical. Each individual Khepri instance can then select from these the views suitable for that particular use.

At present, a first complete iteration of the tool has been developed, configured for use in historical sociolinguistics (see this publication for details).

Selected presentations

Selected software

  • Recon, a multipurpose tool for semi-automatic matching of records against a SPARQL endpoint
  • CORE, a contextual reader based on Linked Data

Publications

2016

Kimmo Kettunen, Eetu Mäkelä, Juha Kuokkala, Teemu Ruokolainen and Jyrki Niemi: Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910. Proceedings of LWDA 2016, Potsdam, Germany, September, 2016. bib pdf
Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1 960 921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 %. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text
Eetu Mäkelä, Thea Lindquist and Eero Hyvönen: CORE - A Contextual Reader based on Linked Data. Proceedings of Digital Humanities 2016, long papers, pp. 267-269, Kraków, Poland, July, 2016. bib pdf link
CORE is a contextual reader application intended to improve user close reading experience, particularly with regard to material in an unfamiliar domain. CORE works by utilizing Linked Data reference vocabularies and datasets to identify entities in any PDF file or web page. For each discovered entity, pertinent information such as short descriptions, pictures, or maps are sourced and presented on a mouse-over, to allow users to familiarize themselves with any unfamiliar concepts, places, etc in the texts they are reading. If further information is needed, an entity can be clicked to open a full context pane, which supports deeper contextualization (also visually, e.g. by displaying interactive timelines or maps). Here, CORE also facilitates serendipitous discovery of further related knowledge, by being able to bring in and suggest related resources from various repositories. Clicking on any such resource loads it into the contextual reader for endless further browsing.
Eetu Mäkelä, Tanja Säily and Terttu Nevalainen: Khepri - a Modular View-Based Tool for Exploring (Historical Sociolinguistic) Data. Proceedings of Digital Humanities 2016, long papers, pp. 269-272, Kraków, Poland, July, 2016. bib pdf link
Digital humanities needs tools that better support the core processes of humanistic inquiry. This includes support for handling uncertainty and incompleteness in the data, for interactive exploration, and for fluidly moving between close and distant reading. The Khepri tool presented here is part of a user-centered project to develop a modular set of components that take these requirements into account, and can be connected and configured to respond to the needs of a particular humanities task and data. Here, the configuration presented is one for the field of historical sociolinguistics, developed in collaboration between computer scientists and sociolinguistic researchers.
/m/fs/seco/www/www.seco.tkk.fi/include/secoweb/utils.php; Thu, 17 Aug 2017 22:28:47 +0300