Building the European Social Innovation Database with Natural Language Processing and Machine Learning – Scientific Data

Project identification

The first step in constructing our data was the identification of potential social innovation projects. We commenced by searching a range of relevant data sources such as the existing online human-inputted social innovation databases and repositories, which formed as a seed to train our machine learning models. We then initiated an additional open search phase where we gradually moved to indirect sources, for instance the European Union Social Innovation Competition¹⁹, a prize given to social innovation projects; Stanford Social Innovation Review²⁰, a practice journal which includes articles and case studies mentioning social innovation projects; and Ashoka²¹, a social innovation community which contains a registry of its members. As the next step, we are planning to move to even more indirect sources such as large crowdfunding platforms not particularly specific to social innovation projects and at the end of the spectrum to the open web search.

Information retrieval

We commenced our initial engagement with the data by scraping the sources containing potential projects. The overall objective here was to identify potential projects by utilising custom scripts based on Scrapy framework. This phase encompassed recording the potential projects identified to a MySQL database with the available information contained in these sources such as project title, URLs, short descriptions, countries, and cities where these projects are based, and other information where available.

For each potential project identified, we fed project URLs to an Apache Nutch v1.18 based system for deep crawling. The crawled unstructured website text indexed in Apache Solr 8.8.1. were then transferred to a collection in MongoDB, including metadata, (e.g., page URL, title, timestamp, etc.). The collection in MongoDB is linked to the MySQL database with a common project ID.

There were certain challenges associated with this; crawling is not always a straightforward process, as some of the project websites are temporarily or permanently unavailable, not amenable to crawling due to extensive use of JavaScript or crawl-protection technologies or containing too little or too large amount of text. This requires extensive human supervision and data cleaning. To address some of these crawling problems, we also implemented an additional layer of Apache Nutch system that uses a Selenium plug-in, which allows us to combine the robustness and scalability of Apache Nutch with the Java Script handling capabilities of Selenium.

Information extraction

Social innovation scores

In the next stage, we derived several features/variables from the unstructured text of the social innovation project websites. The foremost feature is a set of scores measuring the social innovativeness of a project. The definition of social innovation varies in the literature and a precise definition of the concept has proved elusive. As a result, the meaning of this term has been a matter on ongoing discussion in the literature, where a substantial part of the research on the topic is exclusively devoted to the precise definition of the concept, while almost all empirical studies spend considerable amount of effort to justify the definition they employ.

A review of several studies that surveyed different meanings of the modern concept reveals four main elements that define social innovation^{9,22,23,24,25,26,27,28,29}. While nuances between definitions vary, these broad criteria generally apply:

i.

Objectives: Social innovations satisfy societal needs – including the needs of particular social groups (or aim at social value creation) – that are usually not met by conventional innovative activity (c.f. economic innovation), either as a goal or end-product. As a result, social innovation does not produce conventional innovation outputs such as patents and publications.
ii.

Actors and actor interactions: Innovations that are created by actors who usually are not involved in economic innovation, including informal actors, are also defined as social innovation. Some authors stress that innovations must involve predominantly new types of social interactions that achieve common goals and/or innovations that rely on trust rather than mutual-benefit relationships. Similarly, some authors consider innovations that involve different action and diffusion processes but ultimately result in social progress as social innovation.
iii.

Outputs/Outcomes: Early definitions of social innovation strongly relate it with the production of social technologies (c.f. innovation employing only physical technologies) or intangible innovation. This is complemented by some definitions, which indicate that social innovation changes the attitudes, behaviours and perceptions of the actors involved. Some other definitions stress the public good that social innovation creates. Social innovation is often associated with long-term institutional/cultural change.
iv.

Innovativeness: Innovativeness is generally used to differentiate social innovation from social entrepreneurship. This covers not only technological but also non-technological innovation.

Rather than adapting a particular definition, ESID employs a flexible conceptual structure in which we disentangle the concept of social innovation on the basis of the above four definition components³⁰. ESID assigns each project a score for each of the i) objectives, ii) actor and actor interactions, iii) outputs/outcomes, and iv) innovativeness criterion, thus each project has four social innovation scores. The scoring system we developed enables users to filter the social innovation projects based on their preferred definition, while it is also critical to verify the quality of the projects identified, as some of the projects included from some of the sources score very low for all the criteria.

For each of these criteria, we created supervised machine learning models which are trained to predict the scores for each criterion. A manual annotation of about 20% of projects is used to train/evaluate our models. New annotations were added whenever we included projects from a new data source to generate a training set that is representative of our dataset, but we always used the same annotation guidelines and moderator to ensure the consistency. We experimented with various human annotation approaches, including annotating sentences indicating each of the above mentioned four social innovation criteria as well as assigning scores to each of them in several annotation workshops. We observed that project level scores were more reliable, efficient, and robust than sentence annotation for our models. For scoring, we also experimented with various scoring schemas (5-bin, 3-bin, 2-bin) while we settled a 3-bin scoring (0: no indication of the criteria, 1: partial indication of the criteria, 2: full indication of the criteria) as it ensured the optimum inter-annotator agreement and model performance. Annotation criteria and guidelines are presented in Tables1, 2.

Table 1 Annotation Criteria.

Table 2 Annotation Guidelines.

Utilising the human annotation, we experimented with a variety of different types of supervised models and specifications, while we obtained the best results (i.e. F1 score around 0.90) with a Bidirectional Encoder Representations from Transformers (BERT) language representation model³¹ (see Table3) using a 3-bin classification (0,1,2). Using BERT requires fine-tuning on a small dataset of the designated classes. To this end, we used 5,162 projects, 80% of the dataset for training and 20% for testing. Before the classification, we pre-processed the text by applying the following: (i) removing HTML tags if exists and special characters like #, =, and &, (ii) eliminating long sequence of characters, i.e., more than 20 characters, and (iii) dropping duplicated sentences but with maintaining the order of the text. To train BERT classifier we fine-tuned BERT base cased version using the SimpleTransformers framework³².

Table 3 Performance of Social Innovation Criteria Models.

The best results were yielded by fine-tuning BERT followed by the LR classifier (Table3). BERT model has state-of-the-art results in several Natural Language Processing tasks^33,34. BERT is a context-based transformer language model that generates word representations for each word based on distributional semantics hypothesis from linguistics³⁵. The hypothesis says that words that are used and occur in the same contexts tend to communicate similar meanings. The word representations in BERT are based on an unsupervised language model over large amount of text, where certain words are masked, and the task of the model is to predict masked words based on the surrounding context. Contextuality, attention mechanism³⁶ that gives importance to certain portion of the context and the depth of the model in BERT contribute to its superiority compared to previous approaches.

Summarisation

Availability of the short descriptions of projects is important for two reasons. First, short descriptions provide a quick snapshot to the users and offer them the opportunity of extracting further features through topic modelling. Second, the BERT model we employed to score the four social innovation criteria requires a relatively short text length (>512 tokens). To obtain short descriptions which are representative of the text, we experimented with three different models, and we used a combination of these (see Figure Fig.2).

Support Vector Machine (SVM) is a machine learning algorithm that works on high dimensional datasets and works by finding the best hyperplane that separates the datasets into the most differentiative classes. The central idea in SVM is the choice of the hyperplane which sits right in the middle of two classes, or more mathematically, it chooses the hyperplane that maximizes the minimum distance between the hyperplane and all the examples³⁷. In the Binary SVM summarisation implementation, we worked on the assumption that the summarisation task can be modelled as a classification task, where constituent sentences in the text were classified into either being part of the summary or not. It was hypothesized that words in a sentence could indicate whether it described the project (e.g., project aims to, the goal of the project is to, etc.) or not. A training set was created using descriptions obtained from original project sources and the unstructured crawled texts of each website. Cosine similarities were computed for between the sentences from the description and those from the crawled text and if the similarity score was above 0.8, the sentence was labelled as part of the summary, and vice versa. These sentences were then used to train the SVM algorithm to predict which sentences should become part of the summaries³⁷.
For the Social Innovation criteria classifier, an annotated dataset was utilised. The dataset consisted of sentences which were marked as explaining why a project satisfies any of the social innovation criterion. These were used as positive training instances for the SVM classifier.
Summarunner is an extractive summarisation method, developed by IBM Watson, and it utilises recurrent neural networks (GRU). This method visits sentences sequentially and classifies each sentence by whether it should be in the summary. It uses a 100-dimensional word2vec.
Our combined approach comprised of the SVM-based method and Summarunner. We observed that the binary SVM model produced quite long summaries, and this was unsuitable for our goal of generating summaries. As such, we utilised that approach for the initial cleaning of the text, and once this was completed, we utilised Summarunner in shortening the text to generate the final summary³⁸. This combined approach gave the best results for our summarisation task.

Location of the projects is an important information to study the territorial and policy dynamics. Some of the sources already include information on the location of the projects in a higher granularity (e.g., as country) while as we progress into identifying potential projects from more indirect sources, location information becomes unavailable from the sources we identify projects. This motivated us to extract locations from the corpus of projects. In this process, we encountered three problems. First, for some projects location is not mentioned in the text at all. Second, in some projects there are numerous locations mentioned but which one of these locations are the location of the project is not clear. Third, when a location is mentioned, sometimes it is not complete (i.e., only a city mentioned with no country information, but there exist various cities with the same name in various countries). To overcome these problems, we developed an algorithm that uses Named Entity Recognition (NER) to identify locations and Graph Theory to analyse the inter-connectivity between these location entities (Fig.3), with the following key steps:

Text pre-processing: as some projects have vast amount of text, we cleaned duplicates and selected the most possible pages that might contain the correct location of the project such as the home page, about us or contact us pages (or their string variations in the page title or URL).
Location Entity Extraction: after having the text pre-processed, we use Named Entity Recognition (NER) model to identify entities of the type Location. We used a state-of-the-art NER model, called ner-english-ontonotes-large³⁹. The model is based on Transformers architecture and trained on the Ontonotes dataset. After extracting all the entities, we aggregate them based on their frequency for further processing.
Location Scoring: after extracting all the location named entities, the next step is to filter them by removing irrelevant ones, i.e., those which do not represent the project location. For each location entity, we call an online API, called Nominatim⁴⁰ that depends on OpenStreetMap (OSM) data to extract meta-data about the called location entity. Nomanatim is widely used in the study of innovation in geographical context. The API returns a JSON file per request with a list of candidates. Each candidate holds the following keys: corresponding country name, country name in Alpha-3 code encoding, importance score based on OSM ranking algorithm, and location type, whether a city, country, village, suburb, or region. We split the locations into County and City based on the location type. At this stage, each location has an importance score, frequency score, and type, but we still do not know the correct project location. We used Graph Theory to represent the extracted locations along with their meta-data as a directed weighted graph that consists of nodes and directed edges. First, we create the nodes of type Country. Next, for each location of type City, we find its potential corresponding countries, thanks to the Nominatim API response, and we created a directed link from the city to its potential counties. An edge between a given city and its country is weighted following the next equation:

country_city_edge_weight=alpha * city_importance * city_frequency+beta * country_frequency,

Alpha and beta are adjustable hyperparameters to control the impact of the city and country on the edges weight. We set them to 0.7 and 0.3, respectively, as city provides more specific and useful information. Next, each country receives a score equal to the sum of its incoming edges weights.
Location aggregation: a project could take place in one or more countries, and the algorithm should be able to decide these locations. To this end, we use the MeanShift⁴¹ algorithm to cluster countries based on their scores into groups. The group with the highest score is the most probable location(s).
Location Information Retrieval: once we extract the city and country, we retrieve their information using the Location API, Nominatim, which we already cached during the scoring step. We extract the following: city name, city type, country name, country type, country ISO alpha-3 name, longitude and latitude of both city and country, and city and country wikidata IDs.
Information storage: once the information is extracted, we insert them into a table in our MySQL database.

The performance of our location detection algorithm is presented in Table4.

Table 4 Performance of Location Detection Algorithm.

Topics

We tagged projects with an ontology of topics developed as part of the KNOWMAK project⁴². The ontology has two main classes of Key Enabling Technologies (KET) topics and Societal Grand Challenges (SGCs) based on the EU H2020 priorities. The topics are hierarchical and there are two sub-levels for each of these main classes. We used the KNOWMAK Ontology API to classify topics by pushing the summaries obtained in the previous step. The API then returned to us a number of topics associated with each project, as well as assigned scores. We set a threshold for each of the scores and assigned the topics with scores above that threshold as the topics for the project.

Future development

The present study lays the groundwork for future tasks which is underway to improve ESID in the following three areas:

Expansion

In the next stage, we are initiating additional open search phases by exploiting more indirect sources, such as crowdfunding platforms. Whilst this will bring additional information retrieval challenges as these platforms usually do not allow programmatic access, it will expand the database by providing a comprehensive list of social innovation projects.

Extension

We are also planning to add two main features. First, we will work with the actors (e.g., organisations related to the projects) mentioned in the project websites. These actors will be classified by their organisational type (university, public sector, third sector, etc.) and their relationship to the projects (project owner, funder, partner, etc.). Second, we will extract the objectives (i.e., the social problems these projects aim to address, e.g., climate change), activities (activities do they conduct to achieve their objectives, e.g. reducing food waste) and accomplishments (tangible accomplishments achieved, e.g. reducing food waste by a certain amount). We will extract the sentences related to each of these three features and subsequently we will run a topic model to classify these features into categories.

Dynamic retrieval

Currently ESID crawls the project websites and extracts variables related to them only once. We are planning to re-crawl the project websites at regular intervals to identify the changes in these variables, which would then provide us with a time-series data.

www.actusduweb.com

Suivez Actusduweb sur Google News