Corruption can have vast negative consequences. Because of bribery and fraud, national and local administrations overpay for goods and services or investment projects (Bosio et al. 2020), draining public resources away from pro-growth expenditures (Baum et al. 2020). Patronage in public-sector hiring might have long-lasting consequences for the provision of public goods (Colonnelli et al. 2019). On the other hand, curbing corruption has been shown to be effective in reinvigorating private sector development (Giannetti et al. 2017).
A country with a corruption problem
According to the index of Transparency International, which ranks 180 countries inversely by their perceived levels of public sector corruption as stated by experts and businesspeople, Italy was in the 51st position in 2019, far behind Germany (9th), France (23rd), and Spain (30th).1 While organised crime was traditionally linked to the South, corruption has more recently moved North both because organised crime converted from illegal activities to ‘normal’ entrepreneurial business, and due to the emergency of cronyism and graft in the political establishment. As stated by Raffaele Cantone, the president of the Italian Anti-corruption Authority from 2014 to 2019: “Corruption is widespread throughout Italy and represents one of the greatest obstacles to its growth, not only in civil terms but also in social and economic ones. Identifying the areas most exposed to corruption – with specific relation to different regional features – and drafting an Italian map of bribery is an essential tool to fight it.”
How could such a map be drafted? In our recent paper (de Blasio et al. 2020), we show how machine learning (ML) algorithms can be harnessed to predict the occurrence of corruption crimes at the local level. Our main aim is to document the potential of these data-driven tools to support and improve anti-corruption policy targeting. With reference to accounting fraud, a similar exercise has been proposed by Kondo et al. (2020).
Using archives from the Ministry of Interior, we apply ML to predict white-collar crimes as classified by the Italian penal code, which include, among others, corruption, fraud, and collusion. We observe (up to 2014) the number of white-collar crimes for each municipality and year, but the economic value of the crime and the number of people involved are both unknown. Figure 1 shows a map for the year 2012 of the municipalities which experienced an increase in corruption episodes.
Figure 1 Increases in corruption episodes across Italy in 2012
Notes: ∆ WC crime rate is a binary variable taking value 1 if the white-collar crime rate (i.e. the number of white-collar crimes per 1,000 inhabitants) has increased with respect to the previous year (yellow), and 0 otherwise (green).
Armed with a large set of municipality-level characteristics for the year 2011, we train and test our algorithms on the data referring to the period 2011-2012. We then evaluate the accuracy of the predictions by using data from 2012 to 2014. The results we present, based on a classification tree (Hastie et al. 2009), show that even a simple algorithm can achieve high out-of-sample predictive accuracy. For instance, we correctly identify roughly 80% of the municipalities that will experience an increase in corruption crimes.
As depicted in Figure 2, the prediction depends on the values of a few variables, primarily encompassing characteristics of the local labor and housing markets. For example, the algorithm predicts an increase in white-collar crimes in municipalities with more than 7,361 inhabitants, with a mobility share higher than 38%, where buildings have less than 106 square meters on average, and the share of buildings in disuse is larger than 1.2%.
Figure 2 Classification tree for the increase in white-collar crimes
A benchmark to assess the potential gains in the fight against corruption stemming from the use of ML algorithms in Italy is provided by Law 190/2012, “Rules for the prevention and repression of corruption and unlawfulness in public administration”, informally known as “Legge Severino” (from the name of the then Minister of Justice). This law introduced new and more stringent criteria to fight corruption. For instance, it expanded the definition of corruption and enhanced transparency and disclosure requirements for public-sector workers. On top of these general prescriptions, which apply to the entire Italian public administration, the law also introduced a number of additional restrictions related to the possibility of assigning directive positions in public administrations to those who had held political responsibilities in the previous years. At the local level, these more restrictive rules only apply to municipalities with more than 15,000 inhabitants.
The rationale behind excluding smaller municipalities is that the costs related to regulation are presumably larger than the associated benefits in smaller municipalities as they have very few cases of white-collar crime episodes anyway. Smaller municipalities also receive fewer public resources, which makes them, in principle, less exposed to corruption risk. According to the findings of De Angelis et al. (2020), Law 190/2012 seems to have had a positive impact, at least in the South of Italy, where municipalities with over 15,000 residents experienced fewer corruption episodes linked to EU regional transfers after 2012.
Table 1 compares the predictions of the ML algorithm with the anti-corruption threshold. The latter does an excellent job for the municipalities that did not experience increases in corruption crimes as 94.5% of them belong to municipalities below the cut-off. Conversely, only 45.6% of the municipalities with an increase in crime fall above the cut-off and are hence caught in the more severe anti-corruption net. For municipalities experiencing an increase, the ML predictions do a better job and capture 80% of all municipalities with an increase.
However, under current circumstances, it is difficult to imagine that a law would delegate the identification of municipalities to an algorithm. More realistically, ML forecasts could be used to prioritize anti-corruption efforts on the ground, such as those related to police investigations, excluding municipalities above 15,000 inhabitants that are predicted not to experience an increase in corruption episodes.
Table 1 ML predictions and anti-corruption threshold
Transparency and bias
Our classification tree is intuitive and easy to understand (see Figure 2) even without a strong statistical background. This makes it appealing for policy-targeting purposes in a hypothetical scenario in which it was possible to use an algorithm to decide on the areas in which to apply some prescriptions of law. Obviously, a mechanism based just on one single threshold, like the one envisaged by law 190/2012, is easier to understand. However, the cost of endorsing a tree rather than a single population-based threshold rule might be considered not that large: predictions based on the decision tree increase effectiveness in finding ‘corrupted’ municipalities, while the population threshold does not have such a sound foundation. The increase in complexity can hence be justified and communicated as necessary to serve a public aim.
Also related to transparency, ML methods can highlight the targeting that an authority interested in fighting corruption should adopt. Therefore, they can also provide information on whether other additional objectives (so-called ‘omitted payoffs’; see Kleinberg et al. 2018) have a role in this kind of public decisions. For instance, corrupt politicians might conspire to order police investigations far from certain places. Having an ML prediction map, which can easily be compared to areas with actual police efforts, might shed light on such episodes.
An important focus in the ML literature is its potential bias. Suppose that our data are contaminated because corruption episodes are more likely to be reported in certain communities, for instance in municipalities with higher social capital (Putnam 1993). If this is the case, then the ML prediction is likely to be biased as well, and the municipalities with higher endowments of social capital are most likely to be classifies as experiencing an increase in corruption. However, the fact that we have used a sample that is artificially balanced on a number of observables might imply that our results are less exposed to such a bias. In any case, contamination issues have no easy solution. Kleinberg et al. (2020) suggest that it is the use of data-driven methods, rather than non-quantitative approaches, which ensures more progress on this front.
Authors’ note: The views expressed here are those of the authors and do not necessarily reflect those of the institutions they belong to.