Regularly updated data sources with global coverage are essential for (near) real-time forecasting. Christian Oswald and Daniel Ohrenhofer reveal how Wikipedia's ability to capture the salience and controversy of a topic makes it a valuable new data source for conflict forecasting
How can we improve our ability to predict conflicts? Scholars have struggled with this question for a long time. However, as a discipline, and especially over the last two decades, political science has made substantial progress. In general, what we need to improve predictions are advances in data and methodology. Data advances involve both improving the quality of existing data and developing new data sources. We propose a new data source for conflict forecasting efforts: Wikipedia.
The number of country page views indicates international salience of, or interest in, a country. Meanwhile, the number of changes to a country page indicate political controversy between opposing political views.
We took part in the Violence Early-Warning System’s friendly competition to predict changes in battle-related deaths. In our work, we evaluate our findings with out-of-sample predictions using held-out, previously unseen data, and true forecasts into the future. We find support for the predictive power of country page views, whereas we do not for page changes.
Globally available data, updated monthly, are ideal for (near) real-time forecasting. However, many commonly used data sources are available only annually. They are updated once a year, often with considerable delay.
Some of these variables, such as democracy or GDP, tend to be relatively static over time. Furthermore, many data sources face the problem of missing values. These occur when it is not possible to find reliable data for a variable for a given country.
Wikipedia is updated in real time, unlike many commonly used data sources, which may update only annually and with considerable delay
More recent data sources such as Twitter, images or text as data, or mobile phone data, often do not provide global coverage. What's more, collecting and manipulating data from such sources is typically computationally and/or financially costly. Wikipedia provides an alternative data source that, to some extent, overcomes many of these limitations.
Wikipedia has been among the most frequently visited websites in the world for quite some time. It is also the top-referred website from Google. Data for page views, page changes, and other information, are readily and openly available. They can also be extracted with relative ease. They are thus both financially and, compared to other data processing and preparation tasks such as large amounts of text or images, also computationally inexpensive.
Data are updated in real time and they provide global coverage; a Wikipedia page exists for every country, with page views and changes available. This can also be done with different language editions. In this case, we focus on the English-language version but use data from French Wikipedia, too.
So how do we use Wikipedia for conflict forecasting? Country page views, we argue, indicate international interest in that country. Compared to GoogleTrends, which capture what individuals are searching for, Wikipedia page views capture what individuals actually read. Reports on traditional or social media generate interest, after which people search for a country on Google. Readers who want to know more about this country go to the Wikipedia page, which offers a concise summary with all the important information.
During Myanmar's military coup of 2021, interest in the country increased three weeks prior to the first confirmed casualty
An illustrative example of our proposed mechanism is the 2021 military coup in Myanmar. Interest in traditional and social media increased immediately after news broke on 1 February. However, it took three weeks for the first reported casualty to be confirmed. By the end of March there were more than 400 confirmed casualties. Clearly, increased interest is not (necessarily) generated by reported casualties, but by events in the run-up to casualties, such as a coup or mass protests.
Page changes capture controversy between opposing political views. Scientific topics such as global warming (politically controversial) are edited more often than continental drift (less controversial). This signifies a tug of war for interpretive predominance. We applied the same logic to countries and their politically controversial topics.
Results for the out-of-sample predictions on held-out data were promising, but mixed. However, there was a noticeable increase in predictive power of up to 62% for the true forecast into the future. In addition, in an ensemble of nine different models in the prediction competition, our model provides among the most revealing insights, disclosing patterns and trends which other models detect only to a lesser extent. Our model, on aggregate, likewise did well in predicting the right amount of change and especially its variance.
Wikipedia country page views were most useful for predicting early signals of tension escalation and long-term trends like those recently seen in Egypt, Nigeria, and Cameroon. We also obtained promising results for Ethiopia and Rwanda, which experienced low-intensity flare-ups. Our model turned out to be less useful for predicting sudden and sharp changes in conflict intensity, as are happening in Niger, Libya, and Mali.
In general, we found that page views improve predictive performance, but page changes do not.
The introduction of Wikipedia data is a new and promising approach to improve our ability to forecast conflicts. Country page views seem to capture international interest in a country, which can be considered a precursor to escalation. It also provides unique insights which other commonly used data sources or models capture only to a lesser extent.
The number of times a country page on Wikipedia is viewed seems to capture international interest in a country – a potential precursor to escalation
However, this approach still has its limitations, and there are several avenues for future research. Especially with regard to page changes, we outline several ideas. These include filtering by specific keywords to capture, for example, protest or secessionist movements, or territorial disputes. Our aim was to introduce Wikipedia as a data source for conflict forecasting efforts in a way that requires as little user action, and as few decisions, as possible. For this, we received the competition’s award for transparency and replicability.
There is great potential for new and refined measures using Wikipedia data. As a result, we regard this contribution as a starting point for more ambitious approaches in the future, in conflict forecasting and beyond.