Trade mis-invoicing in Africa¶

Policy and media attention on illicit financial flows (IFF) has increased, with the recognition that Africa is a net creditor to the world.


New York Times (2013)	Guardian (2015)

Guardian (2017)	Economist (2019)

What is trade mis-invoicing?

The deliberate mis-statement of price or quantity of internationally traded goods in invoices presented to customs
Can occur at import or export
Can result in an inflow or outflow of money

Motivations for trade mis-invoicing include:

Evading tariffs
Exploiting subsidy regimes
Subverting forex and capital controls
Hiding transfers of wealth

Mechanisms of mis-invoicing

From the reporting country’s perspective, trade mis-invoicing can result in an inflow or an outflow of capital, and this can be achieved by misreporting the value of both imports and exports. Money can be moved out of the country by over-invoicing imports, where that country pays too much money to buy goods from its partner; or by under-invoicing exports, where that country does not charge enough money for the goods that it sells to its partner. Conversely, money can be illicitly routed in to that country by under-invoicing imports, where the country pays too little money to buy goods from its partner; or by over-invoicing exports, where the country charges too much for the goods that it sells to its partner. The direction of trade mis-invoicing in addition to its mechanisms is represented in the diagram below.

conceptual-model

Why does trade mis-invoicing matter?

Outflows undermine the fiscal base and public spending
Financing gap needed to meet the Sustainable Development Goals (SDGs)
Combating trade mis-invoicing is crucial for the mobilization of domestic resources in the continent, and can catalyze sustainable development

Governance loop (credit: William Davis)

Data source

Trade mis-invoicing data-set of the United Nations Economic Commission for Africa
Panel with $n=6,248,254$ of mis-invoiced trade between 179 reporting jurisdictions and 179 partner countries for years 2000-2016 and disaggregated commodities
Citation: Lépissier, Alice, Davis, William, & Ibrahim, Gamal. (2019). Trade Mis-Invoicing Dataset (Version 1).

The entire data-set panel_results.Rdata with ~6 million observations is available online.
The unit of observation is a reporter-partner-commodity-year tuple, where the data represent the illicit flow embedded in a transaction between a reporter $i$ and a partner $j$ for a commodity $c$ in year $t$.
The full panel is ~2GB. This notebook works with smaller summary data-sets generated from the full panel. The code to generate these summary data-sets is available on the repository https://github.com/walice/Trade-IFF by running the script file Compute IFF Estimates.R.

Nota bene: the figures in this notebook that are not generated here can be reconstructed by running the Data Visualization.R script in that same repository.

Methodology for calculating mis-invoiced trade

We locate mis-invoicing in the discrepancies between reported trade flows and their mirrored statistics
But not all observed discrepancies are due to illicit motives!
Imports tend to be recorded on the basis of Cost of Insurance and Freight (CIF), while exports are recorded Free On Board (FOB)

Our approach

Estimate the discrepancies between imports and mirror net exports as a function of both licit and illicit predictors
Perform a harmonization procedure in order to generate a reconciled value that represents our best estimate of what the legitimate value of the trade should be on a FOB basis
Calculate the IFF embedded in each transaction as the difference between the observed value and the reconciled value

What this data-set represents

The data-set presents estimates of mis-invoicing both in imports and exports. I will be working with the import data.
There is a low and a high variant of mis-invoicing estimates. Discussion of how they vary is beyond the scope of this project, but note that we will be working with the high variant exclusively, as those are the official numbers of UNECA.
A negative value represents an illicit inflow, while a positive value represents an illicit outflow.

The panel can be aggregated over several dimensions, e.g. partner country, commodity, year. There are two methods to aggregate up the illicit flow:
- Net basis: simply add up all negative and positive values, so that inflows and outflows cancel each other out
- Gross Excluding Reversals (GER) basis: ignore all inflows, and sum up the positive values only across trading partners

Aggregation strategy

It is important to note that for a given country pair $i$ and $j$ in a given year $t$, the same trade flow can be associated with either an inflow or an outflow according to what commodity is traded. While it might seem unlikely that illicit funds might be traveling in both directions for the same trade flow, there could be a variety of different actors doing this for different reasons. For example, country $i$ might have export taxes on raw materials and export subsidies for manufacturing output, which would give an incentive to under-invoice exports of raw materials (resulting in an illicit outflow) and to over-invoice exports of manufactured goods (resulting in an illicit inflow). Alternatively, a criminal syndicate that has a legitimate front company may use re-invoicing to send money to an affiliate in another country to make an investment (e.g. hiring “muscle” to fight off a competitor) and then bring funds back using exports to the same country when the investment bears fruit.

The object of analytical inquiry should guide the choice of aggregation strategy. For example, stakeholders interested in getting a picture of the total amount of funds departing a country on balance should favor a net aggregation basis. By contrast, stakeholders interested in better understanding the drivers and mechanisms of IFF should favor aggregation using GER in order to identify where money is flowing out or in.

In this project, I will be using unsupervised machine learning to extract insights on analytically relevant dimensions of variation, so I will use summary data-sets that have been aggregated on a Gross Excluding Reversals basis. In other words, I am more interested in the direction and topology of the illicit flows, rather than their magnitude.

Zoom in on Africa

During 2000-2016, the continent lost on average $83 billion a year in gross illicit outflows
Net cumulative flows during that period were $362 billion

Africa-map Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF

Goals¶

While generating estimates of the dollar value of illicit trade has been helpful to shed light on the severity of the problem, the next step in the analysis is to further understand the nature of the illicit activity in terms of its origins, destinations, and sectors.

Therefore, the goal of this project is to extract meaningful insights on illicit trade using unsupervised machine learning techniques. By doing so, I can identify analytically relevant categories and dimensions of variation, in order to generate hypotheses and guide further work.

This project will apply the following techniques to the data:

Dimension reduction using Principal Components Analysis (PCA)
Clustering
Graph analysis

Data wrangling¶

I will work with two summary forms of the full data-set panel_results.Rdata: one where the data is aggregated up to the reporter country, and one where the dyadic nature of the data is preserved.

In the case where the data is aggregated up to the reporter country, there are two possible views of the feature space. On the one hand, the feature space can be spanned by the sectors of the different commodities ($p=21$). On the other hand, the feature space can be spanned by the reporter African countries ($p=46$).

Mis-invoiced trade for countries (aggregated data)
- View 1: unit of observation = reporter-year; features = sectors
- View 2: unit of observation = sector-year; features = reporters
Bilateral matrix of mis-invoiced trade (dyadic data)
- Unit of observation = reporter-partner-year

Mis-invoiced trade data (for African reporting countries)¶

Let’s read in a summary data-set of the full panel which contains mis-invoicing estimates for reporter $i$ in commodity $c$ in year $t$ (aggregated over partner countries $j$ using the GER strategy).

IFF_Sector = pd.read_csv('Data/GER_Orig_Sect_Year_Africa.csv')

Let’s also extract the relevant labels for the countries in the data-set.

obs_info = IFF_Sector.reset_index().drop_duplicates(['reporter.ISO', 'year'])[['reporter.ISO', 
                                                                               'year', 
                                                                               'reporter', 
                                                                               'rIncome', 
                                                                               'rDev']]
obs_info = obs_info.replace({'LIC': 'Low income', 
                             'LMC': 'Lower-middle income',
                             'UMC': 'Upper-middle income',
                             'HIC': 'High income'})
obs_info = obs_info.rename(columns={'rIncome': 'Income group (World Bank)', 
                                    'rDev': 'Country status (UN)'})

Mis-invoiced trade for countries by sectors

The table below displays the raw data of mis-invoiced trade for 46 African countries in 21 different sectors during 2000-2016, aggregated over partner countries.

		section	Imp_IFF_hi
reporter.ISO	year
DZA	2001	Animal and Animal Products	1.914633e+08
	2001	Pulp of Wood or of Other Fibrous Material	7.436143e+07
	2001	Textiles	3.560644e+07
	2001	Footwear and Headgear	1.146507e+06
	2001	Stone, Glass, and Ceramics	2.935880e+07
...	...	...	...
ZWE	2007	Arms and Ammunition	0.000000e+00
	2009	Works of Art	0.000000e+00
	2013	Miscellaneous Manufactured Articles	0.000000e+00
	2014	Miscellaneous Manufactured Articles	0.000000e+00
	2014	Mineral Products	0.000000e+00

11133 rows × 2 columns

Mis-invoiced trade data (bilateral trade matrix)¶

Now, let’s read in another summary data-set of the full panel which contains mis-invoicing estimates for reporter $i$ with partner $j$ in year $t$ (aggregated over commodities $c$ using the GER strategy).

IFF_Dest = pd.read_csv('Data/GER_Orig_Dest_Year_Africa.csv')

Mis-invoiced trade for dyads

The table below displays the raw data of mis-invoiced trade for 46 African countries with 167 partner countries during 2000-2016, aggregated over commodity sectors.

		partner.ISO	Imp_IFF_hi
reporter.ISO	year
DZA	2001	AND	1.609561e+04
	2001	ARG	4.717459e+07
	2001	AUS	2.027641e+07
	2001	AUT	7.641706e+07
	2001	BEL	2.285729e+07
...	...	...	...
ZWE	2014	MDG	0.000000e+00
	2014	SGP	0.000000e+00
	2014	CHE	0.000000e+00
	2014	ARE	0.000000e+00
	2015	UGA	0.000000e+00

34757 rows × 2 columns

Crosswalk data¶

Finally, let’s import a table that contains labels for the countries, such as their ISO codes, geographic regions, and income groupings that they belong to.

Metadata for countries

crosswalk = pd.read_excel("Data/crosswalk.xlsx").rename(columns={'Country': 'country'})
crosswalk.head()

	ISO3166.3	ISO3166.2	country	UN_Region	UN_Region_Code	UN_Sub-region	UN_Sub-region_Code	UN_Intermediate_Region	UN_Intermediate_Region_Code	UN_M49_Code	...	WB_Income_Group_Code	WB_Region	WB_Lending_Category	WB_Other	Longitude	Latitude
0	ABW	AW	Aruba	Americas	19.0	Latin America and the Caribbean	419.0	Caribbean	29.0	533.0	...	HIC	Latin America and Caribbean	..	NaN	-69.982677	12.520880
1	AFG	AF	Afghanistan	Asia	142.0	Southern Asia	34.0	NaN	NaN	4.0	...	LIC	South Asia	IDA	HIPC	66.004734	33.835231
2	AFG	AF	Afghanistan, Islamic Republic of	Asia	142.0	Southern Asia	34.0	NaN	NaN	4.0	...	LIC	South Asia	IDA	HIPC	66.004734	33.835231
3	AGO	AO	Angola	Africa	2.0	Sub-Saharan Africa	202.0	Middle Africa	17.0	24.0	...	LMC	Sub-Saharan Africa	IBRD	NaN	17.537368	-12.293361
4	AIA	AI	Anguila	Americas	19.0	Latin America and the Caribbean	419.0	Caribbean	29.0	660.0	...	NaN	NaN	NaN	NaN	-63.064989	18.223959

5 rows × 25 columns

Auxiliary functions¶

I create three auxiliary functions to aid me in running Principal Component Analysis:

A function which creates the feature space for PCA from a data-set in a long format (which is the format of IFF_Sector and IFF_Dest)
A function which runs PCA and plots a biplot for two chosen principal components
A function which plots a cumulative scree plot

def create_features(data, values, features, obs):
    """
    Convert data-set in long format to wide and preserve information on year.
    
    data: {IFF_Sector_Imp, IFF_Dest_Imp, IFF_Dest_AFR, ...}, as Pandas dataframe, 
        name of data-set from which to create feature space, must be in long format
    values: {'Imp_IFF_hi', 'Exp_IFF_hi'}, as string, values that data-set will represent
    features: {'reporter.ISO', 'section', 'partner.ISO'}, as string, 
        what to use as the feature space
    obs: {'section', 'reporter.ISO'}, as string, what to use as the observation level
    """
    features_data = data.pivot_table(values=values, 
                                     columns=features, 
                                     index=[obs, 'year'], 
                                     fill_value=0)
    return features_data

# Extra function to run the analysis on average mis-invoicing over the years
def create_features_mean(data, values, features, obs):
    """
    Convert data-set in long format to wide.
    
    data: {IFF_Sector_mean, IFF_Dest_mean, ...}, as Pandas dataframe, 
        name of data-set from which to create feature space, must be in long format
    values: {'Imp_IFF_hi', 'Exp_IFF_hi'}, as string, values that data-set will represent
    features: {'reporter.ISO', 'section', 'partner.ISO'}, as string, 
        what to use as the feature space
    obs: {'section', 'reporter.ISO'}, as string, what to use as the observation level
    """
    features_data = data.pivot_table(values=values, 
                                     columns=features, 
                                     index=obs, 
                                     fill_value=0)
    return features_data

def biplot_PCA(features_data, nPC=2, firstPC=1, secondPC=2, obs='reporter.ISO', show_loadings=False):
    """
    Project the data in the 2-dimensional space spanned by 2 principal components
    chosen by the user, along with a bi-plot of the top 3 loadings per PC, and color observations
    by class label.

    Args:
        features_data: as Pandas dataframe, data-set of features
        nPC: number of principal components
        firstPC: integer denoting first principal component to plot in bi-plot
        secondPC: integer denoting second principal component to plot in bi-plot
        obs: string denoting index of class labels (in features_data)
        show_loadings: Boolean indicating whether PCA loadings should be displayed
    Returns:
        plot (interactive)
        pca_loadings (if show_loadings=True)
    """
        
    # Run PCA (standardize data beforehand)
    features_data_std = StandardScaler().fit_transform(features_data)
    pca = PCA(n_components=nPC, random_state=234)
    princ_comp = pca.fit_transform(features_data_std)

    # Extract PCA loadings
    cols = ['PC' + str(c+1) for c in np.arange(nPC)]
    pca_loadings = pd.DataFrame(pca.components_.T, 
                                columns=cols,
                                index=list(features_data.columns))
   
    # Extract PCA scores
    pca_scores = pd.DataFrame(princ_comp, 
                              columns=cols)
    pca_scores[obs] = features_data.reset_index()[obs].values.tolist()
    pca_scores['year'] = features_data.reset_index()['year'].values.tolist()
    
    score_PC1 = princ_comp[:,firstPC-1]
    score_PC2 = princ_comp[:,secondPC-1]
    
    # Generate plot data
    if obs == 'reporter.ISO':
        plot_data = pd.merge(pca_scores, obs_info, on=[obs, 'year'])
        color_obs = 'reporter'
        tooltip_obs = ['reporter', 'year', 'Income group (World Bank)', 'Country status (UN)']
    else:
        plot_data = pca_scores
        color_obs = 'section'
        tooltip_obs = ['section', 'year']

    # Return chosen PCs to plot
    PC1 = 'PC'+str(firstPC)
    PC2 = 'PC'+str(secondPC)

    # Extract top loadings (in absolute value)
    # TO DO: use dict to iterate over
    toploadings_PC1 = pca_loadings.apply(lambda x: abs(x)).sort_values(by=PC1).tail(3)[[PC1, PC2]]
    toploadings_PC2 = pca_loadings.apply(lambda x: abs(x)).sort_values(by=PC2).tail(3)[[PC1, PC2]]

    originsPC1 = pd.DataFrame({'index':toploadings_PC1.index.tolist(), 
                               PC1: np.zeros(3), 
                               PC2: np.zeros(3)})
    originsPC2 = pd.DataFrame({'index':toploadings_PC2.index.tolist(), 
                               PC1: np.zeros(3), 
                               PC2: np.zeros(3)})
    
    toploadings_PC1 = pd.concat([toploadings_PC1.reset_index(), originsPC1], axis=0)
    toploadings_PC2 = pd.concat([toploadings_PC2.reset_index(), originsPC2], axis=0)

    toploadings_PC1[PC1] = toploadings_PC1[PC1]*max(score_PC1)*1.5
    toploadings_PC1[PC2] = toploadings_PC1[PC2]*max(score_PC2)*1.5
    toploadings_PC2[PC1] = toploadings_PC2[PC1]*max(score_PC1)*1.5
    toploadings_PC2[PC2] = toploadings_PC2[PC2]*max(score_PC2)*1.5
    
    # Project top 3 loadings over the space spanned by 2 principal components
    lines = alt.Chart().mark_line().encode()
    for color, i, dataset in zip(['#440154FF', '#21908CFF'], [0,1], [toploadings_PC1, toploadings_PC2]):
        lines[i] = alt.Chart(dataset).mark_line(color=color).encode(
        x= PC1 +':Q',
        y= PC2 +':Q',
        detail='index'
    ).properties(
        width=400,
        height=400
    )
    
    # Add labels to the loadings
    text=alt.Chart().mark_text().encode()
    for color, i, dataset in zip(['#440154FF', '#21908CFF'], [0, 1], [toploadings_PC1[0:3], toploadings_PC2[0:3]]):
        text[i] = alt.Chart(dataset).mark_text(
                align='left',
                baseline='bottom',
                color=color
            ).encode(
                x= PC1 +':Q',
                y= PC2 +':Q',
                text='index'
            )
    
    # Scatter plot colored by observation class label
    points = alt.Chart(plot_data).mark_circle(size=60).encode(
        x=alt.X(PC1, axis=alt.Axis(title='Principal Component ' + str(firstPC))),
        y=alt.X(PC2, axis=alt.Axis(title='Principal Component ' + str(secondPC))),
        color=alt.Color(color_obs, scale=alt.Scale(scheme='category20b'),
                       legend=alt.Legend(orient='right')),
        tooltip=tooltip_obs
    ).interactive()
    
    # Bind it all together
    chart = (points + lines[0] + lines[1] + text[0] + text[1])    
    chart.display()

    if show_loadings:
        return pca_loadings

def scree_plot(features_data, show_explained_var=False):
    """
    Create a cumulative scree splot and (optional) return the explained variance by each component.
    
    features_data: as Pandas dataframe, the data-set on which to run PCA
    show_explained_var: as Boolean, flag for whether to return explained variance
    """
    
    features_data_std = StandardScaler().fit_transform(features_data)
    pca = PCA(n_components=features_data_std.shape[1], random_state=234)
    princ_comp = pca.fit_transform(features_data_std)
    
    explained_var = pd.DataFrame({'PC': np.arange(1,features_data_std.shape[1]+1),
                                  'var': pca.explained_variance_ratio_,
                                  'cumvar': np.cumsum(pca.explained_variance_ratio_)})

    # Adapted from https://altair-viz.github.io/gallery/multiline_tooltip.html
    # Create a selection that chooses the nearest point & selects based on x-value
    nearest = alt.selection(type='single', nearest=True, on='mouseover',
                            fields=['PC'], empty='none')

    # The basic line
    line = alt.Chart(explained_var).mark_line(interpolate='basis', color='#FDE725FF').encode(
        alt.X('PC:Q',
            scale=alt.Scale(domain=(1, len(explained_var))),
            axis=alt.Axis(title='Principal Component')
        ),
        alt.Y('cumvar:Q',
            scale=alt.Scale(domain=(min(explained_var['cumvar']), 1)),
            axis=alt.Axis(title='Cumulative Variance Explained')
        ),
    )

    # Transparent selectors across the chart. This is what tells us
    # the x-value of the cursor
    selectors = alt.Chart(explained_var).mark_point().encode(
        x='PC:Q',
        opacity=alt.value(0),
    ).add_selection(
        nearest
    )

    # Draw points on the line, and highlight based on selection
    points = line.mark_point().encode(
        opacity=alt.condition(nearest, alt.value(1), alt.value(0))
    )

    # Draw text labels near the points, and highlight based on selection
    text = line.mark_text(align='left', dx=5, dy=-5).encode(
        text=alt.condition(nearest, 'cumvar:Q', alt.value(' '))
    )

    # Draw a rule at the location of the selection
    rules = alt.Chart(explained_var).mark_rule(color='gray').encode(
        x='PC:Q',
    ).transform_filter(
        nearest
    )

    # Put the five layers into a chart and bind the data
    out = alt.layer(
        line, selectors, points, rules, text
    ).properties(
        title='Cumulative scree plot',
        width=500, height=300
    )
    
    out.display()
    
    if show_explained_var:
        return explained_var[['PC', 'var']]

PCA on feature space (for individual reporting countries)¶

I now run PCA on the data-set IFF_Sector which includes mis-invoiced trade for a country $i$ in commodity $c$ in year $t$.

Sector features¶

One possible view of IFF_Sector is to view the sectors as the feature space where $p=21$. Therefore, I use the function create_features() to generate the data-set to be used in PCA analysis.

sector_features = create_features(IFF_Sector_Imp, 'Imp_IFF_hi', 
                                  features='section', obs='reporter.ISO')
sector_features

	section	Animal and Animal Products	Animal or Vegetable Fats and Oils	Arms and Ammunition	Base Metals	Chemicals and Allied Industries	Footwear and Headgear	Machinery and Electrical	Mineral Products	Miscellaneous Manufactured Articles	Pearls, Precious Stones and Metals	...	Precision Instruments	Prepared Foodstuffs	Pulp of Wood or of Other Fibrous Material	Raw Hides, Skins, Leather, and Furs	Stone, Glass, and Ceramics	Textiles	Transportation	Vegetable Products	Wood and Wood Products	Works of Art
reporter.ISO	year
AGO	2009	2.694946e+08	7.652455e+07	67738.261242	1.025920e+09	5.115176e+08	2.920410e+07	1.525512e+09	1.727293e+09	2.472936e+08	5.575633e+05	...	1.242156e+08	4.781184e+08	1.149992e+08	1.097769e+07	1.349018e+08	1.827787e+08	2.198838e+09	4.249722e+08	4.337043e+07	177640.929155
	2010	2.064906e+08	7.311396e+07	0.000000	9.989842e+08	2.575443e+08	7.853065e+06	1.308337e+09	2.953387e+09	8.404302e+07	3.209177e+05	...	1.115183e+08	2.342889e+08	7.332134e+07	4.013044e+06	7.601278e+07	8.285704e+07	9.611656e+08	1.959128e+08	2.310136e+07	404828.231824
	2011	2.995594e+08	8.058180e+07	89365.880480	6.689511e+08	2.931591e+08	1.172881e+07	1.389743e+09	2.286539e+09	4.413341e+07	9.645181e+05	...	1.418956e+08	3.159980e+08	9.911835e+07	9.995669e+06	6.607126e+07	7.912925e+07	7.070474e+08	3.734589e+08	1.363601e+07	586586.948365
	2012	7.103770e+08	2.746924e+08	261265.895948	1.753970e+09	8.540407e+08	3.823724e+07	2.790368e+09	9.203666e+08	1.763904e+08	1.271289e+06	...	2.073111e+08	1.025668e+09	2.641831e+08	1.394356e+07	1.630307e+08	1.683128e+08	1.975553e+09	6.541154e+08	3.980884e+07	739122.050820
	2013	4.031560e+08	1.541679e+08	188343.533148	9.230300e+08	4.793908e+08	1.212363e+07	1.428283e+09	2.009915e+09	4.871315e+07	6.250082e+05	...	1.063123e+08	4.931687e+08	1.798533e+08	8.930246e+06	7.882264e+07	7.858108e+07	4.962937e+09	3.995114e+08	1.836680e+07	418841.640492
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ZWE	2011	3.098093e+07	6.060789e+07	9124.882478	1.693777e+08	2.260227e+09	2.817909e+06	3.231869e+08	2.747704e+08	9.872441e+06	3.618592e+06	...	2.136919e+07	1.144497e+08	5.855387e+07	6.259512e+05	2.652474e+07	4.726781e+07	8.963644e+08	1.874874e+08	6.016952e+06	35481.251510
	2012	2.784481e+07	5.549232e+07	856.631347	1.085453e+08	4.929549e+08	7.015919e+06	3.451931e+08	1.461844e+08	1.510668e+07	2.876014e+05	...	3.162707e+07	2.025934e+08	6.024629e+07	1.412797e+06	2.482629e+07	5.147542e+07	9.468142e+08	1.900125e+08	7.901902e+06	59677.045599
	2013	0.000000e+00	0.000000e+00	0.000000	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	...	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000
	2014	0.000000e+00	0.000000e+00	0.000000	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	...	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	0.000000
	2015	1.458046e+07	7.135043e+07	19192.507302	8.703951e+07	2.299085e+08	4.468801e+06	3.437244e+08	1.401916e+08	2.432263e+07	3.672531e+07	...	3.233606e+07	3.778318e+07	3.157795e+07	1.061420e+06	1.724449e+07	2.841926e+07	2.137692e+08	1.585549e+08	4.569325e+06	6305.070602

624 rows × 21 columns

Biplots¶

The figure below displays a biplot of the first 2 principal components, in addition to reporting the loadings for all principal components in the following table.

biplot_PCA(sector_features, 10, 1, 2, obs='reporter.ISO', show_loadings=True)

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10
Animal and Animal Products	0.223097	-0.285686	-0.052218	0.200337	0.280789	-0.075192	-0.159824	0.157986	-0.133621	0.067581
Animal or Vegetable Fats and Oils	0.145805	-0.041750	0.262100	-0.281551	-0.378544	-0.470151	0.409970	0.471670	0.177591	-0.031639
Arms and Ammunition	0.030559	-0.044358	0.524677	-0.654182	0.434354	0.055627	-0.201368	-0.113006	-0.079142	-0.115541
Base Metals	0.235966	-0.206534	-0.261180	-0.147777	-0.019680	-0.153272	0.017922	-0.247084	0.174715	-0.069873
Chemicals and Allied Industries	0.271238	0.107591	-0.117222	-0.089472	-0.096976	-0.130645	-0.071165	-0.219172	-0.198166	0.036517
Footwear and Headgear	0.206357	0.378112	0.015170	-0.019961	-0.001383	-0.029500	-0.042735	0.010544	0.026510	-0.252435
Machinery and Electrical	0.272101	0.160466	0.076229	0.091921	-0.067637	-0.159494	0.022549	-0.135941	-0.157103	-0.115312
Mineral Products	0.238785	0.071925	-0.098357	-0.201011	-0.166463	0.073908	-0.133348	0.121464	-0.241190	0.675042
Miscellaneous Manufactured Articles	0.246123	0.283544	0.122775	0.033592	-0.049743	-0.014818	-0.017630	0.069480	-0.247438	-0.043076
Pearls, Precious Stones and Metals	0.060193	0.251657	-0.373387	-0.153599	0.567863	0.055855	0.647708	0.088517	-0.049224	0.089860
Plastics and Rubbers	0.264093	-0.144764	0.124109	0.236111	0.112342	0.023331	0.062706	0.076398	0.119468	-0.076649
Precision Instruments	0.223754	0.362288	-0.013533	0.023957	-0.084425	-0.090796	-0.080419	-0.186415	-0.262231	-0.078130
Prepared Foodstuffs	0.252722	-0.227483	-0.012139	0.160146	0.177798	-0.024717	-0.088546	0.289203	-0.187744	-0.058613
Pulp of Wood or of Other Fibrous Material	0.272304	-0.112890	-0.010570	0.074023	0.024044	-0.016865	0.003323	-0.123810	0.135842	0.200798
Raw Hides, Skins, Leather, and Furs	0.206454	0.084348	0.178427	0.121865	-0.101317	0.619649	0.102592	0.287981	-0.057302	-0.251296
Stone, Glass, and Ceramics	0.225603	0.111290	0.264504	0.269760	0.061210	-0.018303	0.176815	-0.368090	0.400752	-0.018317
Textiles	0.209936	-0.103101	-0.054321	-0.264334	-0.273480	0.542184	0.126078	-0.076346	0.216014	0.139611
Transportation	0.238443	-0.119628	0.256225	0.081261	0.184539	-0.059237	0.015553	-0.095856	0.219874	0.334486
Vegetable Products	0.245486	-0.284943	-0.122225	-0.029458	0.071482	-0.036325	-0.074223	0.202113	-0.106308	-0.240274
Wood and Wood Products	0.201178	-0.222154	-0.377723	-0.300290	-0.132165	0.007197	-0.048891	-0.185469	0.074259	-0.351532
Works of Art	0.089055	0.388758	-0.239308	-0.078967	0.157495	-0.041382	-0.486227	0.369405	0.561157	0.018339

The first two principal components can be interpreted as recovering variation in the size of the illicit flow. The top loadings on the first principal component include Machinery and Electrical, and Chemicals and Allied Industries, which are the sectors with the highest amount of mis-invoiced trade. The treemap figure below represents the average gross yearly outflow to different sectors in Africa, and we can see that the aforementioned sectors account for a large chunk of that.

By contrast, the loadings on the second component are associated with sectors which are specialized, precision sectors that do not account for a large amount of mis-invoicing, compared to others. Indeed, Works of Art, and Footwear and Headgear are in the top-right corner of the treemap figure (and do not even appear with labels as they are so small).

Thus, it seems like the magnitude of the trade mis-invoicing (in dollar values) explains the highest variability in the data.

treemap-sectors Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF

biplot_PCA(sector_features, 10, 3, 4, obs='reporter.ISO')

The sixth principal component (displayed in the figure below) captures variation with respect to the natural origin of the commodity being traded and its functional use. The top loadings include sectors where the commodity is generated from natural resources (e.g. animals or vegetables) and which is functionally used in clothing, cooking, and cosmetics (e.g. textiles, leathers, oils).

biplot_PCA(sector_features, 10, 5, 6, obs='reporter.ISO')

Explained variance¶

The first 6 principal components examined above explain 83% of the variance in the data. The first principal component, which captures magnitude of the flow, accounts of 52% of the variance alone.

scree_plot(sector_features, show_explained_var=True)

	PC	var
0	1	0.524988
1	2	0.121415
2	3	0.054159
3	4	0.049898
4	5	0.042107
5	6	0.041007
6	7	0.036014
7	8	0.028847
8	9	0.024146
9	10	0.018279
10	11	0.012558
11	12	0.009987
12	13	0.007011
13	14	0.006890
14	15	0.005213
15	16	0.004848
16	17	0.004001
17	18	0.002819
18	19	0.002489
19	20	0.001800
20	21	0.001524

Variance-stabilizing and normalizing transformations¶

The figure below displays the distribution of mis-invoicing for the 21 different sectors. It is heavily right-skewed across all sectors, with very few high-value mis-invoiced transactions.

_images/Unsupervised-learning-trade_82_0.png

I explore two possible transformations of the data: a (modified) log transformation and a Yeo–Johnson. The values of the data-set are all non-negative, with some values equal to 0. Therefore, the figure below displays the distribution of the data after taking $log(x+1)$.

sector_features_log = sector_features.apply(lambda x: np.log(x+1) if np.issubdtype(x.dtype, np.number) else x, axis=0)

_images/Unsupervised-learning-trade_85_0.png

The Yeo-Johnson transformation is similar to a Box-Cox transformation, but allows for zero values. When $x=0$ and the tuning parameter $\lambda=0$, it will apply the same modified log transformation, $log(x=1)$, as described above.

sector_features_yeo = power_transform(sector_features, method='yeo-johnson', standardize=True)

sector_features_yeo = pd.DataFrame(sector_features_yeo,
                                   index=sector_features.index,
                                   columns=sector_features.columns)
fig, axes = joypy.joyplot(sector_features_yeo, colormap=plt.cm.viridis, figsize=(8,8),
                          title='Distribution of mis-invoicing across sectors (Yeo–Johnson transformation)');

_images/Unsupervised-learning-trade_88_0.png

The two figures below display the results of running the PCA analysis on transformations of the data using a modified log and a Yeo-Johnson transformation, respectively.

The top loadings on the second principal component are the same and refer to Works of Art, Pearls, Precious Stones and Metals, and Arms and Ammunitions. These sectors refer to extremely high value-added sectors where the traded commodity is a finished product.

By contrast, the loadings on the first principal component (in either transformation) refer to heavy industries, where the commodity is an intermediate product (as opposed to a finished product). Even after applying normalizing transformations, the first principal component recovers sectors which were responsible for a large amount of the dollar value of mis-invoicing in the continent during 2000-2016.

biplot_PCA(sector_features_log, 10, 1, 2, obs='reporter.ISO')

biplot_PCA(sector_features_yeo, 10, 1, 2, obs='reporter.ISO')

Finally, transforming the data improves the performance of the PCA analysis, in the sense that the first two principal components now explain 90% of the variance in the data.

scree_plot(sector_features_yeo)

Country features¶

The second view of IFF_Sector is to consider the reporter countries as the feature space. Below, I use the function create_features() and specify reporter.ISO as the features in order to generate the data-set for PCA. In this case, we have $p=46$.

The interpretation in this application is to understand the variation in where the illicit outflows originate. This is because the values have been aggregated using the GER strategy which captures outflows only.

country_features = create_features(IFF_Sector_Imp, 'Imp_IFF_hi',
                                   features='reporter.ISO', obs='section')
country_features

	reporter.ISO	AGO	BDI	BEN	BFA	BWA	CAF	CIV	CMR	COG	COM	...	STP	SWZ	SYC	TGO	TUN	TZA	UGA	ZAF	ZMB	ZWE
section	year
Animal and Animal Products	2000	0.000000	0.000000	0.000000e+00	0.000000e+00	0.000000e+00	0.000000	0.000000e+00	0.000000e+00	0.00000	0.0	...	0.000000	0.000000	0.000000e+00	4.044600e+06	1.572640e+07	1.060197e+06	2.902091e+05	2.341209e+07	0.000000e+00	0.000000e+00
	2001	0.000000	0.000000	9.464068e+06	2.013378e+06	4.018742e+06	351958.882652	7.103384e+07	2.552567e+07	0.00000	0.0	...	0.000000	0.000000	3.439935e+07	1.087398e+07	0.000000e+00	0.000000e+00	2.522664e+05	1.854458e+07	1.832385e+06	0.000000e+00
	2002	0.000000	288779.572593	1.458450e+07	7.756380e+05	3.524820e+06	213560.077651	7.767896e+07	2.907597e+07	0.00000	0.0	...	0.000000	0.000000	0.000000e+00	6.574580e+06	9.664435e+06	0.000000e+00	6.944404e+05	1.452489e+07	4.354400e+05	7.346496e+06
	2003	0.000000	217920.960357	2.297721e+07	4.533317e+05	0.000000e+00	0.000000	1.286866e+08	3.485783e+07	0.00000	0.0	...	0.000000	0.000000	0.000000e+00	6.290272e+06	1.732785e+07	1.294665e+06	1.358896e+06	2.153838e+07	1.422125e+06	0.000000e+00
	2004	0.000000	0.000000	2.742553e+07	3.429694e+05	5.275531e+04	0.000000	9.283012e+07	4.940023e+07	0.00000	0.0	...	0.000000	59632.763605	0.000000e+00	3.901862e+06	3.574647e+07	0.000000e+00	8.053124e+05	2.868016e+07	2.791570e+06	0.000000e+00
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Works of Art	2012	739122.050820	0.000000	0.000000e+00	1.054676e+04	2.623004e+05	0.000000	1.820478e+04	3.875133e+05	0.00000	0.0	...	0.000000	0.000000	0.000000e+00	1.125696e+03	0.000000e+00	7.252236e+04	1.388514e+04	1.371564e+07	3.453691e+04	5.967705e+04
	2013	418841.640492	0.000000	0.000000e+00	0.000000e+00	2.282032e+04	0.000000	1.287629e+04	2.104949e+04	0.00000	0.0	...	9584.734907	0.000000	0.000000e+00	0.000000e+00	0.000000e+00	5.191721e+05	2.986963e+04	1.006657e+07	1.017682e+05	0.000000e+00
	2014	390140.730849	0.000000	0.000000e+00	0.000000e+00	9.331590e+04	0.000000	1.017252e+06	1.826816e+05	8003.37246	0.0	...	20941.755221	0.000000	0.000000e+00	0.000000e+00	0.000000e+00	6.731774e+04	2.129568e+04	2.543377e+07	0.000000e+00	0.000000e+00
	2015	0.000000	0.000000	0.000000e+00	0.000000e+00	5.977903e+04	0.000000	0.000000e+00	0.000000e+00	0.00000	0.0	...	4447.368931	0.000000	5.857979e+05	0.000000e+00	3.088086e+03	5.385830e+05	1.193596e+04	6.697451e+06	5.843484e+04	6.305071e+03
	2016	0.000000	0.000000	0.000000e+00	0.000000e+00	4.195233e+04	0.000000	0.000000e+00	0.000000e+00	0.00000	0.0	...	1327.652396	0.000000	4.463501e+05	0.000000e+00	0.000000e+00	3.443084e+04	2.305443e+03	1.295357e+07	0.000000e+00	0.000000e+00

357 rows × 46 columns

Biplots¶

The three figures below display biplots for the first and second principal components, for the third and fourth principal components, and for the fifth and sixth principal components, respectively. The loadings for each reporter country are also reported.

biplot_PCA(country_features, 10, 1, 2, obs='section', show_loadings=True)

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10
AGO	0.172866	5.695754e-02	1.915290e-01	-6.571267e-02	-1.498597e-02	2.683983e-01	-6.131957e-02	-9.901379e-02	-1.352142e-02	1.960619e-01
BDI	0.159396	1.399172e-02	1.100982e-01	1.072287e-01	-5.780584e-02	-3.006342e-02	1.311807e-01	-6.306715e-02	8.380471e-02	-1.595603e-01
BEN	0.166261	2.839527e-01	-4.853377e-02	-1.227506e-01	-8.929173e-02	2.840864e-02	-5.069704e-02	-1.050768e-01	1.115173e-01	-1.479526e-01
BFA	0.200912	1.961349e-01	-1.479219e-02	-3.284903e-02	-7.687652e-03	1.947496e-01	-8.961797e-03	-9.812748e-02	5.298346e-02	-6.100222e-02
BWA	0.058704	5.101605e-04	1.048248e-01	1.176491e-01	1.116177e-01	4.542630e-02	5.291913e-02	1.702242e-01	1.074927e-01	6.990029e-01
CAF	0.136213	-2.941790e-01	7.144585e-02	1.466869e-01	1.005191e-01	-8.211151e-04	1.767076e-02	-9.841998e-02	4.459720e-02	-6.455230e-02
CIV	0.170157	-4.190252e-02	-1.992766e-01	-1.178971e-01	-8.139640e-02	2.099028e-01	1.481461e-01	-9.124417e-02	9.430036e-02	-2.959315e-02
CMR	0.199563	6.685309e-03	2.723196e-02	-1.591573e-01	-1.158744e-01	4.651018e-02	4.997217e-02	-4.900403e-02	6.736277e-02	-4.601119e-02
COG	0.092230	-1.314851e-01	3.146417e-01	-1.361157e-01	-2.737181e-01	2.682128e-01	-4.522047e-02	1.802042e-02	4.328099e-02	3.111778e-02
COM	0.096409	-2.063490e-01	2.921122e-01	-2.226963e-01	-7.053853e-02	-1.456883e-01	8.379586e-02	-6.576844e-02	9.280223e-02	-1.472937e-01
CPV	0.162562	-1.932140e-01	3.911200e-02	-6.076881e-02	6.992487e-02	3.542767e-02	2.890140e-01	-1.259045e-01	5.085305e-02	-4.426356e-02
DZA	0.097941	4.125012e-02	-3.694645e-02	4.185959e-01	-2.291923e-01	-2.224267e-01	-7.905464e-02	8.880404e-02	-9.011562e-02	-5.841375e-02
EGY	0.175564	1.852732e-01	9.778806e-02	1.870187e-01	8.946189e-03	-2.924499e-03	1.181851e-02	-4.328654e-02	4.339753e-03	-1.929171e-01
ETH	0.121994	-1.897909e-01	7.695418e-02	3.167105e-01	2.484336e-01	-1.802844e-01	-7.872203e-02	-9.597359e-02	-2.026803e-02	-1.094738e-01
GAB	0.042052	-1.989049e-01	-3.375917e-01	1.247060e-02	-2.588424e-01	-4.558121e-02	8.313195e-02	9.621236e-02	-7.997286e-02	-1.514322e-01
GHA	0.107784	-1.996829e-01	-6.503997e-02	6.007675e-02	-3.272539e-01	2.451643e-01	4.926298e-02	2.807364e-01	-3.340735e-01	-5.059900e-03
GIN	0.052400	-1.350973e-01	-3.351174e-01	-8.155761e-02	-4.401417e-02	-4.334381e-02	-3.444465e-01	-1.967460e-02	1.193137e-01	-2.723141e-02
GMB	0.162689	8.598690e-02	-9.929674e-02	-2.316451e-01	2.262847e-01	1.330992e-01	2.662606e-01	1.742886e-01	-9.401992e-02	-6.012329e-02
GNB	0.000016	-1.996604e-02	-8.847847e-02	-2.131943e-02	6.445089e-02	9.726163e-02	-2.861441e-01	3.631337e-01	5.592464e-01	-2.128313e-01
KEN	0.117135	-1.733487e-01	-2.826062e-01	-1.414929e-01	-4.852900e-02	1.122855e-01	-1.773947e-01	-2.061400e-01	-2.190356e-01	1.721080e-01
LBY	0.000000	0.000000e+00	-0.000000e+00	1.654361e-24	1.355253e-20	0.000000e+00	0.000000e+00	2.081668e-17	-2.220446e-16	-9.714451e-17
LSO	0.031001	-1.145362e-01	1.172217e-01	-8.783260e-02	-6.316290e-02	1.845925e-01	-1.155909e-01	4.273390e-01	1.846813e-01	-2.597209e-02
MAR	0.196450	-1.442017e-03	-1.407898e-01	-1.151253e-01	4.096951e-02	-2.501230e-01	-1.218791e-01	-1.159986e-01	-2.772756e-02	9.286997e-02
MDG	0.187159	1.768800e-01	-7.603241e-02	1.954600e-02	3.882583e-02	-6.632947e-02	1.303028e-02	1.697959e-01	3.997142e-02	-1.389174e-01
MLI	0.160911	-1.212446e-01	-4.690084e-02	-1.486856e-01	8.742975e-02	-2.762972e-01	8.729569e-02	2.434701e-01	-8.316575e-02	3.368757e-02
MOZ	0.147230	1.608981e-01	9.473738e-02	-1.983947e-02	-2.406356e-01	-2.654582e-01	-2.507807e-03	5.991142e-02	9.261240e-02	1.184466e-01
MRT	0.142351	2.001775e-01	7.014463e-02	-1.332407e-01	-2.101042e-01	-3.003770e-01	-1.237907e-01	-9.258926e-02	7.149717e-02	1.249696e-01
MUS	0.183529	7.441746e-02	-1.253785e-01	7.372575e-02	1.275615e-01	4.308231e-02	8.698864e-02	1.594759e-01	3.117765e-02	8.570793e-02
MWI	0.196370	1.208545e-01	-1.804584e-02	-1.462830e-01	2.430288e-02	7.000297e-02	-7.395843e-02	-3.780468e-02	-6.553461e-02	-1.841761e-02
NAM	0.173960	1.201903e-01	5.978034e-02	9.556719e-02	2.108371e-01	-2.454492e-02	2.251590e-01	1.968563e-01	-5.019903e-04	1.281223e-01
NER	0.191899	-1.071062e-01	1.036822e-01	-3.490165e-02	8.799988e-03	4.101665e-02	-1.036213e-01	-1.501394e-01	5.717845e-02	7.629436e-02
NGA	0.153000	-8.916565e-02	2.197113e-01	2.082393e-02	-2.395585e-01	-1.659439e-01	-1.019044e-01	1.527359e-01	-3.809505e-02	8.557721e-02
RWA	0.186734	-3.747262e-02	1.608489e-01	2.275355e-01	3.285765e-02	9.220145e-02	-1.415166e-01	-1.482959e-01	-3.582179e-03	2.383508e-02
SDN	0.122997	-2.804786e-01	1.221158e-01	-1.345090e-01	-5.996290e-02	-1.354197e-01	-2.219003e-02	1.378054e-01	-7.715869e-02	7.151525e-04
SEN	0.198048	-3.415201e-02	-1.017232e-01	-1.030800e-01	-1.082653e-01	-2.190300e-02	-1.642827e-02	-9.561341e-02	-9.388535e-02	5.110459e-02
SLE	0.000000	-2.281917e-34	3.682813e-32	6.034292e-30	-1.730867e-25	-3.631998e-25	-1.524001e-23	1.682666e-21	-1.953635e-20	-7.252720e-21
STP	0.093915	2.692090e-01	7.292042e-02	2.717052e-01	-2.160666e-01	2.404257e-01	-1.858177e-02	-4.529530e-03	-4.636809e-02	-6.325914e-02
SWZ	0.004459	-9.037762e-02	-2.172397e-01	4.529566e-02	-1.475804e-01	-4.240062e-03	1.559841e-01	-2.357850e-01	5.362103e-01	2.615336e-01
SYC	0.012375	-7.886680e-02	-1.900444e-01	2.392704e-01	-2.750048e-01	-4.851200e-02	5.049505e-01	2.850039e-02	1.579691e-01	1.255474e-02
TGO	0.184941	8.829184e-02	-4.195564e-02	1.516059e-02	-3.407124e-02	-7.070676e-02	-2.815878e-02	-5.328573e-03	-7.180032e-03	-5.552878e-02
TUN	0.189500	-1.490447e-01	-5.294429e-02	1.368356e-01	1.828756e-01	1.829014e-02	-1.057055e-01	1.539954e-01	1.181978e-02	-9.518775e-04
TZA	0.198557	1.129045e-01	-8.978178e-02	-1.479353e-01	1.914884e-01	-8.737537e-02	1.348190e-01	8.895631e-02	-7.496216e-02	2.430973e-02
UGA	0.220482	1.310309e-01	-7.584076e-02	3.261941e-02	7.776106e-02	2.816737e-02	-2.272271e-02	3.592921e-03	-2.836216e-02	-7.275351e-02
ZAF	0.190412	-7.411430e-02	-1.422615e-01	1.624323e-01	5.822959e-02	-5.136024e-02	-2.069728e-01	5.163535e-03	-1.084788e-02	1.159689e-01
ZMB	0.150023	-1.782886e-01	-7.122649e-02	2.088026e-01	1.466777e-01	2.720530e-01	-5.921551e-02	-1.225385e-01	6.453899e-03	-3.512622e-02
ZWE	0.098964	-1.998928e-01	2.092424e-01	-4.305728e-02	1.206815e-01	-8.504574e-02	1.333520e-01	-1.496703e-01	1.545886e-01	-1.980936e-01

biplot_PCA(country_features, 10, 3, 4, obs='section')

biplot_PCA(country_features, 10, 5, 6, obs='section')

The interpretation should be carried out cautiously. The top loadings on the first principal components are Uganda, followed by Burkina Faso and Cameroon. Within Africa, Uganda is the country which experienced the highest gross outflows as a proportion of its total trade over the period 2000-2016, and the second highest as a proportion of its GDP over that same period (see figure below).

percent-trade Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF) percent-GDP Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF)

It is interesting to note that the PCA algorithm does not identify the countries which had the highest average dollar value of illicit outflows during 2000-2016, which were South Africa and Angola. This is because the data has been standardized to have mean 0 and unit variance, as is common when conducting PCA, since PCA is sensitive to outliers.

Rather, the PCA algorithm recovers in the top loadings for the most important principal components many of the countries that were among the top origins for outflows relative to trade or GDP, such as Burkina Faso, Ethiopia, Malawi, Seychelles, Mali, and Zimbabwe, even though the algorithm was run on (standardized) dollar values of outflows.

Explained variance¶

The first 6 principal components explain 63% of the variance in the data, with the first principal component explaining 38% of the variance alone.

scree_plot(country_features, show_explained_var=True)

	PC	var
0	1	3.763510e-01
1	2	7.215794e-02
2	3	5.939846e-02
3	4	5.123519e-02
4	5	3.787572e-02
5	6	3.396778e-02
6	7	3.106831e-02
7	8	2.638919e-02
8	9	2.585997e-02
9	10	2.268292e-02
10	11	2.200332e-02
11	12	2.037173e-02
12	13	1.925219e-02
13	14	1.910017e-02
14	15	1.719724e-02
15	16	1.561816e-02
16	17	1.346239e-02
17	18	1.260948e-02
18	19	1.150043e-02
19	20	1.118470e-02
20	21	1.060775e-02
21	22	8.985424e-03
22	23	8.762526e-03
23	24	8.330893e-03
24	25	6.857884e-03
25	26	6.636890e-03
26	27	5.660594e-03
27	28	5.216669e-03
28	29	4.582342e-03
29	30	4.137391e-03
30	31	3.673426e-03
31	32	3.514407e-03
32	33	3.274164e-03
33	34	3.004834e-03
34	35	2.840875e-03
35	36	2.611974e-03
36	37	2.263592e-03
37	38	2.120142e-03
38	39	1.688935e-03
39	40	1.536694e-03
40	41	1.461075e-03
41	42	1.319987e-03
42	43	9.527948e-04
43	44	6.724485e-04
44	45	1.776402e-33
45	46	1.776402e-33

Variance-stabilizing and normalizing transformations¶

The figures below display the distribution of the mis-invoiced trade in each of the features of the feature space (after dropping those countries with 0 outflows in all sectors).

# Figure out which columns are composed of all 0s
country_features.apply(lambda x: (x == 0).all(), axis=0)

reporter.ISO
AGO    False
BDI    False
BEN    False
BFA    False
BWA    False
CAF    False
CIV    False
CMR    False
COG    False
COM    False
CPV    False
DZA    False
EGY    False
ETH    False
GAB    False
GHA    False
GIN    False
GMB    False
GNB    False
KEN    False
LBY     True
LSO    False
MAR    False
MDG    False
MLI    False
MOZ    False
MRT    False
MUS    False
MWI    False
NAM    False
NER    False
NGA    False
RWA    False
SDN    False
SEN    False
SLE     True
STP    False
SWZ    False
SYC    False
TGO    False
TUN    False
TZA    False
UGA    False
ZAF    False
ZMB    False
ZWE    False
dtype: bool

# Drop the corresponding features
country_features = country_features.drop(columns=['LBY', 'SLE'])

_images/Unsupervised-learning-trade_112_0.png

_images/Unsupervised-learning-trade_112_1.png

Again, I explore the consequences of applying a modified log transformation and a Yeo-Johnson transformation, respectively.

country_features_log = country_features.apply(lambda x: np.log(x+1) if np.issubdtype(x.dtype, np.number) else x, axis=0)

_images/Unsupervised-learning-trade_115_0.png

_images/Unsupervised-learning-trade_115_1.png

country_features_yeo = power_transform(country_features, method='yeo-johnson', standardize=True)

_images/Unsupervised-learning-trade_117_0.png

_images/Unsupervised-learning-trade_117_1.png

The top loadings on the first principal component, in both transformations, capture an origin hub located in the north-west of the continent with Benin, Niger, and Senegal featuring heavily.

The top loadings on the second principal component capture a south-southeast axis which runs from Gabon to Angola to Swaziland (now called Eswatini).

biplot_PCA(country_features_log, 10, 1, 2, obs='section')

biplot_PCA(country_features_yeo, 10, 1, 2, obs='section')

PCA on bilateral trade matrix¶

The previous sections explored relevant dimensions of variation by considering the feature space spanned by the origin of the illicit outflows, whether it be the source countries or the source sectors.

Now, I turn to exploring the relevant dimensions of illicit flows by analyzing the feature space spanned by the destination of those illicit outflows.

Let’s apply the create_features() function to the summary data-set IFF_Dest, where we have $p=167$.

partner_features = create_features(IFF_Dest_Imp, 'Imp_IFF_hi', 
                                   features='partner.ISO', obs='reporter.ISO')
partner_features

	partner.ISO	AGO	ALB	AND	ARE	ARG	ARM	ATG	AUS	AUT	AZE	...	URY	USA	VCT	VEN	VNM	VUT	YEM	ZAF	ZMB	ZWE
reporter.ISO	year
AGO	2009	0.0	0.0	0.0	0.000000e+00	6.014820e+07	0.0	0	6.980492e+06	3.119279e+05	0.000000	...	2.361993e+06	9.270820e+08	0	0.0	4.734647e+07	0	0.000000	1.016827e+09	1.060946e+05	309894.487818
	2010	0.0	0.0	0.0	0.000000e+00	2.963158e+07	0.0	0	2.971282e+06	5.626929e+06	18935.636990	...	1.644130e+06	8.268511e+08	0	0.0	2.970677e+07	0	0.000000	3.315930e+08	2.665205e+04	420951.156359
	2011	0.0	0.0	0.0	0.000000e+00	5.044258e+07	0.0	0	9.077825e+06	2.409418e+06	20266.842796	...	1.016561e+06	8.601995e+08	0	0.0	2.451965e+07	0	0.000000	3.281534e+08	4.525653e+05	0.000000
	2012	0.0	0.0	0.0	0.000000e+00	2.055669e+08	0.0	0	7.483480e+06	1.354856e+07	22457.788787	...	1.582490e+06	1.441162e+09	0	0.0	5.942382e+07	0	354866.063494	7.944924e+08	1.180964e+03	0.000000
	2013	0.0	0.0	0.0	0.000000e+00	5.839522e+07	0.0	0	1.569576e+07	8.770795e+06	1734.698687	...	3.447835e+06	7.139003e+08	0	0.0	3.931132e+07	0	0.000000	4.272753e+08	2.855892e+05	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ZWE	2011	0.0	0.0	0.0	5.290029e+07	2.310119e+07	0.0	0	2.325259e+06	3.171896e+05	0.000000	...	2.721346e+04	6.462989e+08	0	0.0	5.288219e+06	0	0.000000	3.088036e+09	5.016118e+07	0.000000
	2012	0.0	0.0	0.0	5.923175e+07	1.011059e+07	0.0	0	1.664879e+06	1.220841e+06	0.000000	...	0.000000e+00	6.069423e+08	0	0.0	4.745996e+06	0	0.000000	1.242790e+09	1.704507e+08	0.000000
	2013	0.0	0.0	0.0	0.000000e+00	0.000000e+00	0.0	0	0.000000e+00	0.000000e+00	0.000000	...	0.000000e+00	0.000000e+00	0	0.0	0.000000e+00	0	0.000000	0.000000e+00	0.000000e+00	0.000000
	2014	0.0	0.0	0.0	0.000000e+00	0.000000e+00	0.0	0	0.000000e+00	0.000000e+00	0.000000	...	0.000000e+00	0.000000e+00	0	0.0	0.000000e+00	0	0.000000	0.000000e+00	0.000000e+00	0.000000
	2015	0.0	0.0	0.0	5.042948e+07	0.000000e+00	0.0	0	5.764013e+05	6.721992e+05	0.000000	...	9.012925e+04	4.109564e+07	0	0.0	3.093477e+03	0	0.000000	6.474374e+08	0.000000e+00	0.000000

624 rows × 167 columns

Biplots¶

The figure below displays the first and second principal components. It appears that the first principal component recovers the top destinations of illicit outflows, as the first loading corresponds to China which was the top average destination for illicit outflows from Africa during 2000-2016, and the second loading corresponds to the United States which was the second largest destination.

biplot_PCA(partner_features, 10, 1, 2, show_loadings=True)

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10
AGO	0.064538	-0.127775	-0.044675	-0.081190	0.015199	-0.030508	-0.020684	-0.075254	0.026087	0.018444
ALB	0.044858	0.100240	-0.085332	0.010378	0.222166	-0.056160	0.016405	-0.068677	0.098802	-0.006931
AND	0.002434	0.006593	-0.006567	0.005445	-0.015707	0.021064	-0.006253	-0.004508	0.001468	-0.016877
ARE	0.091433	-0.038099	0.067933	-0.046062	0.123878	-0.061712	-0.030800	0.190913	-0.147428	0.060691
ARG	0.129513	0.139207	-0.082968	0.019367	-0.043149	0.085106	-0.014389	0.038525	0.024086	0.000942
...	...	...	...	...	...	...	...	...	...	...
VUT	0.000000	0.000000	0.000000	0.000000	-0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
YEM	0.040005	-0.060918	0.007451	0.143530	0.037673	0.037506	0.085598	0.261330	0.309931	-0.017295
ZAF	0.008847	0.007834	0.091261	-0.022447	0.026785	0.001632	-0.043753	0.104236	-0.037778	0.124183
ZMB	0.040152	0.020620	-0.037492	0.009481	0.093282	-0.176393	-0.033166	0.069316	-0.109787	0.018559
ZWE	0.058102	-0.112442	0.020694	0.215291	0.042842	0.010566	0.044425	0.149634	0.182897	0.017658

167 rows × 10 columns

top-destinations Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF

The top loadings on the other principal components reveal some unexpected sources of variation for destination countries.

The top loadings are:

Second principal component:
- Top loading 1: El Salvador
- Top loading 2: Dominican Republic
- Top loading 3: Nepal
Third principal component:
- Top loading 1: Belgium
- Top loading 2: Norway
- Top loading 3: Panama
Fourth principal component:
- Top loading 1: Iran
- Top loading 2: Suriname
- Top loading 3: Papua New Guinea
Fifth principal component:
- Top loading 1: Central African Republic
- Top loading 2: Kuwait
- Top loading 3: Albania
Sixth principal component:
- Top loading 1: Ecuador
- Top loading 2: Slovenia
- Top loading 3: Algeria

These dimensions of variation are much harder to interpret. Nevertheless, a few points stand out.

First, apart from the first principal component, the loadings do not appear to be driven by the magnitude of the illicit outflows. Second, the loadings recovered are surprising. One “usual suspect” which appears is Panama (on the third principal component) which is one of the oldest and best-known tax haven in the Americas. Third, this suggests that the analysis of the destinations of illicit outflows from Africa merits much deeper analysis to “follow the money” (which is outside the scope of this project).

biplot_PCA(partner_features, 10, 3, 4)

biplot_PCA(partner_features, 10, 5, 6)

Explained variance¶

With a feature space of $p=167$, 12 principal components are required to explain at least 50% of the variance in the data.

scree_plot(partner_features)

Variance-stabilizing and normalizing transformations¶

The sections below explore the effect of variance-stabilizing transformations on the partner_features data-set.

Korea and Nicaragua appear as top loadings on the first and second principal components, respectively.

When running PCA on the transformed data, the projection of the principal component scores on the first two principal components takes on a half-moon shape.

Expand the code chunks to see the details.

# Figure out which columns are composed of all 0s
partner_features.apply(lambda x: (x == 0).all(), axis=0)
noIFFpartner = partner_features.loc[:,partner_features.apply(lambda x: (x == 0).all(), axis=0)].columns.tolist()
noIFFpartner

['ATG',
 'BLZ',
 'BMU',
 'DMA',
 'GNB',
 'GRD',
 'KGZ',
 'KNA',
 'LCA',
 'SLB',
 'TON',
 'VCT',
 'VUT']

# Drop the corresponding features
partner_features = partner_features.drop(columns=noIFFpartner)

# Plot the distribution of illicit flows for a random sample of partner countries
fig, axes = joypy.joyplot(partner_features.sample(n=15, axis=1, random_state=234), 
                          colormap=plt.cm.rainbow, figsize=(8,8),
                          title='Distribution of mis-invoicing across random sample of partner countries');

_images/Unsupervised-learning-trade_139_0.png

# Apply a modified log transformation
partner_features_log = partner_features.apply(lambda x: np.log(x+1) if np.issubdtype(x.dtype, np.number) else x, axis=1)

# Plot the distribution of illicit flows for the transformed data
fig, axes = joypy.joyplot(partner_features_log.sample(n=15, axis=1, random_state=234), 
                          colormap=plt.cm.rainbow, figsize=(8,8),
                          title='Distribution of mis-invoicing across random sample of partner countries (log transformation)');

_images/Unsupervised-learning-trade_141_0.png

# Biplot for the first 2 principal components for the transformed data (modified log)
biplot_PCA(partner_features_log, 10, 1, 2)

# Apply a modified Yeo-Johnson transformation
partner_features_yeo = power_transform(partner_features, method='yeo-johnson', standardize=True)

# Plot the distribution of illicit flows for the transformed data
partner_features_yeo = pd.DataFrame(partner_features_yeo,
                                    index=partner_features.index,
                                    columns=partner_features.columns)
random.seed(234)
fig, axes = joypy.joyplot(partner_features_yeo.sample(n=15, axis=1, random_state=234), 
                          colormap=plt.cm.rainbow, figsize=(8,8),
                          title='Distribution of mis-invoicing across random sample of partner countries (Yeo-Johnson transformation)');

_images/Unsupervised-learning-trade_144_0.png

# Biplot for the first 2 principal components for the transformed data (Yeo-Johnson)
biplot_PCA(partner_features_yeo, 10, 1, 2)

Intra-African illicit financial flows¶

The analysis will now be restricted to intra-African illicit financial flows, that is, illicit capital that originates from African countries and which flows to African countries too.

Data wrangling¶

# Import bilateral trade mis-invoicing data
IFF_Dest = pd.read_csv('Data/GER_Orig_Dest_Year_Africa.csv')

# Restrict to destinations in Africa only
IFF_Dest_AFR = IFF_Dest[IFF_Dest['pRegion'] == 'Africa']
IFF_Dest_AFR = IFF_Dest_AFR.fillna(0).drop(columns=['reporter', 'rIncome', 'rDev', 
                                                    'partner', 'pRegion', 'pIncome', 'pDev',
                                                    'Imp_IFF_lo', 'Exp_IFF_lo']) \
            .set_index(['reporter.ISO', 'year'])

# Extract trade mis-invoicing in imports and exports, respectively
IFF_Dest_Imp_AFR = IFF_Dest_AFR[['partner.ISO', 'Imp_IFF_hi']]
IFF_Dest_Exp_AFR = IFF_Dest_AFR[['partner.ISO', 'Exp_IFF_hi']]

Let’s create the feature space from the African partner countries which are the destinations of illicit ouflows originating from the countries listed in reporter.ISO.

partner_features_AFR = create_features(IFF_Dest_Imp_AFR, 'Imp_IFF_hi', 
                                       features='partner.ISO', obs='reporter.ISO')
partner_features_AFR

	partner.ISO	AGO	BDI	BEN	BFA	BWA	CAF	CIV	CMR	COG	COM	...	STP	SWZ	SYC	TGO	TUN	TZA	UGA	ZAF	ZMB	ZWE
reporter.ISO	year
AGO	2009	0.0	0.0	0.0	0.0	0.000000e+00	0.0	7.356525e+05	0.000000	0.000000e+00	0.0	...	0.000000	0.0	0.0	0.000000	1.628157e+06	0.000000e+00	0.0	1.016827e+09	1.060946e+05	309894.487818
	2010	0.0	0.0	0.0	0.0	2.025235e+05	0.0	5.577699e+07	53088.378146	1.402793e+08	0.0	...	0.000000	0.0	0.0	252999.300576	1.137066e+08	5.529461e+03	0.0	3.315930e+08	2.665205e+04	420951.156359
	2011	0.0	0.0	0.0	0.0	5.501246e+03	0.0	6.262938e+06	257786.746458	3.466061e+07	0.0	...	0.000000	0.0	0.0	0.000000	4.623718e+07	1.745372e+06	0.0	3.281534e+08	4.525653e+05	0.000000
	2012	0.0	0.0	0.0	0.0	0.000000e+00	0.0	1.425377e+07	222663.224270	2.587614e+07	0.0	...	8435.069683	0.0	0.0	0.000000	1.262796e+07	1.253616e+05	0.0	7.944924e+08	1.180964e+03	0.000000
	2013	0.0	0.0	0.0	0.0	2.945193e+05	0.0	2.125395e+06	0.000000	4.357466e+06	0.0	...	957.605075	0.0	0.0	0.000000	4.438123e+06	1.035603e+05	0.0	4.272753e+08	2.855892e+05	0.000000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ZWE	2011	0.0	0.0	0.0	0.0	8.741233e+07	0.0	0.000000e+00	0.000000	0.000000e+00	0.0	...	0.000000	0.0	0.0	0.000000	0.000000e+00	1.295280e+06	0.0	3.088036e+09	5.016118e+07	0.000000
	2012	0.0	0.0	0.0	0.0	3.187848e+07	0.0	0.000000e+00	0.000000	0.000000e+00	0.0	...	0.000000	0.0	0.0	0.000000	0.000000e+00	2.729399e+06	0.0	1.242790e+09	1.704507e+08	0.000000
	2013	0.0	0.0	0.0	0.0	0.000000e+00	0.0	0.000000e+00	0.000000	0.000000e+00	0.0	...	0.000000	0.0	0.0	0.000000	0.000000e+00	0.000000e+00	0.0	0.000000e+00	0.000000e+00	0.000000
	2014	0.0	0.0	0.0	0.0	0.000000e+00	0.0	0.000000e+00	0.000000	0.000000e+00	0.0	...	0.000000	0.0	0.0	0.000000	0.000000e+00	0.000000e+00	0.0	0.000000e+00	0.000000e+00	0.000000
	2015	0.0	0.0	0.0	0.0	8.142369e+06	0.0	0.000000e+00	0.000000	0.000000e+00	0.0	...	0.000000	0.0	0.0	0.000000	0.000000e+00	8.238550e+06	0.0	6.474374e+08	0.000000e+00	0.000000

604 rows × 46 columns

There are strong regional dynamics in international trade, so it is useful to extract information on which regions the countries belong to. The crosswalk data-set includes information on the geographical groupings of each country. The codes are taken from the UN Statistics Division M49 standard (https://unstats.un.org/unsd/methodology/m49/).

The UN_Intermediate_Region code is nested within the UN_Sub-region code. In the M49 standard grouping for Africa, there are two sub-regions: Northern Africa and Sub-Saharan Africa. The latter contains further geographical groupings: Eastern Africa, Middle Africa, Southern Africa, Western Africa. Therefore, we need to reconcile both so that the regional labels include Northern Africa and the 4 groupings within Sub-Saharan Africa. The code below achieves this, and this is automatized by the region_labels() function further down.

# Merge in UN Intermediate Region groups
partner_features_AFR = pd.merge(left=partner_features_AFR.reset_index(), 
                                right=crosswalk[['ISO3166.3', 'UN_Intermediate_Region']].drop_duplicates('ISO3166.3'), 
                                how='left', 
                                left_on='reporter.ISO', right_on='ISO3166.3')

# Which African countries do not have a UN Intermediate region?
noregion_mask = partner_features_AFR['UN_Intermediate_Region'].isnull()
countries_noregion = partner_features_AFR[noregion_mask]['reporter.ISO'].unique().tolist()
countries_noregion

['DZA', 'EGY', 'LBY', 'MAR', 'SDN', 'TUN']

# Which UN Sub-region do those countries belong to?
crosswalk[crosswalk['ISO3166.3'].isin(countries_noregion)]['UN_Sub-region'].unique()

array(['Northern Africa'], dtype=object)

# Assign Northern Africa as a group for those countries
partner_features_AFR.loc[partner_features_AFR['reporter.ISO'].isin(countries_noregion), 'UN_Intermediate_Region'] = 'Northern Africa'
partner_features_AFR = partner_features_AFR.set_index(['reporter.ISO', 'year']).drop(columns=['ISO3166.3', 'UN_Intermediate_Region'])

The figure below displays the biplot for the first two principal components estimated using the intra-African bilateral matrix of illicit flows.

biplot_PCA(partner_features_AFR, partner_features_AFR.shape[1], 1, 2)

The countries with the top 3 loadings on the first principal component are Mozambique (MOZ), Madagascar (MDG) and Comoros (COM). Madagascar and the Comoros are islands in the Mozambique channel, where two well-known conduits of illicit financial flows are located: the Seychelles and Mauritius.

The second principal component seems to capture Northern Africa, with Algeria and Tunisia scoring the second and third largest loading, respectively.

Removing outliers¶

There are vibrant debates in the field on estimating trade mis-invoicing based on mismatches in official trade data, much of which is outside the scope of this project. However, suffice to say that there are some outliers that might have large values of mis-invoicing due to benign reasons. For example, South Africa does not report gold exports to UN Comtrade (the underlying source of official trade statistis), and so any discrepancies found by looking at its mirrored trade statistics would result in spurious mis-invoicing estimates.

I remove outliers based on extreme values on the first two principal components. This effectively removes most of the observations for South Africa. Further work is needed, though, to remove outliers based on domain knowledge, especially if using this data to do a deep dive on certain dimensions (rather than looking at aggregate patterns, as is the case here).

features_data_std = StandardScaler().fit_transform(partner_features_AFR)
pca = PCA(n_components=partner_features_AFR.shape[1])
princ_comp = pca.fit_transform(features_data_std)
outlying = (princ_comp[:,0] > 6) | (princ_comp[:,1] > 4)
partner_features_AFR[outlying].index

MultiIndex([('AGO', 2010),
            ('CIV', 2013),
            ('EGY', 2012),
            ('GHA', 2002),
            ('MAR', 2005),
            ('MAR', 2006),
            ('MAR', 2007),
            ('MAR', 2008),
            ('MAR', 2009),
            ('MAR', 2010),
            ('MAR', 2011),
            ('MAR', 2012),
            ('NGA', 2011),
            ('ZAF', 2005),
            ('ZAF', 2006),
            ('ZAF', 2008),
            ('ZAF', 2010),
            ('ZAF', 2011),
            ('ZAF', 2012),
            ('ZAF', 2013),
            ('ZAF', 2014),
            ('ZAF', 2015),
            ('ZAF', 2016)],
           names=['reporter.ISO', 'year'])

partner_features_AFR_noout = partner_features_AFR[~outlying]

The figure below displays the projection of the data without outliers on the first 2 principal components. Some interesting regional patterns appear.

Top loadings on the first principal component seem to capture a western African hub, with Cote d’Ivoire and Togo scoring highly. By contrast, the second principal component seems to recover a southern African hub, with top loadings going to South Africa, Malawi, and Botswana.

biplot_PCA(partner_features_AFR_noout, 46, 1, 2)

In order to discern better any groupings, the next section will color the observations according to an aggregate class label.

Coloring by class label¶

The function region_labels() extracts the regional labels from the crosswalk and further divides the “Sub-Saharan Africa” group into its 4 corresponding subgroups, as discussed above.

def region_labels(features_data):
    """
    Return regional labels for countries in the data
    
    features_data: as Pandas dataframe, data-set of features
    """
    # Extract observation labels from features data
    obs_labels = pd.DataFrame(features_data.reset_index().set_index('reporter.ISO').index)

    # Merge with UN intermediate regions from crosswalk
    obs_labels = pd.merge(left=obs_labels,
                          right=crosswalk[['ISO3166.3', 'UN_Intermediate_Region']].drop_duplicates('ISO3166.3'), 
                          how='left', 
                          left_on='reporter.ISO', right_on='ISO3166.3') \
    .drop(columns='ISO3166.3')

    # Create a mask for the Northern African regions and replace the missing intermediate region
    noregion_mask = obs_labels['UN_Intermediate_Region'].isnull()
    countries_noregion = obs_labels[noregion_mask]['reporter.ISO'].unique().tolist()
    obs_labels.loc[obs_labels['reporter.ISO'].isin(countries_noregion), 'UN_Intermediate_Region'] = 'Northern Africa'

    return obs_labels

The function biplot_PCA_classes() projects the data in the two principal components chosen by the user, and further colors the observations either according to their UN regional grouping (by specifying classes='UN_Intermediate_Region') or according to their World Bank income group classification (by specifying classes='Income group (World Bank)').

def biplot_PCA_classes(features_data, nPC=2, firstPC=1, secondPC=2, classes='UN_Intermediate_Region'):
    """
    Projects the data in the 2-dimensional space spanned by 2 principal components
    chosen by the user, along with a bi-plot of the top 3 loadings per PC, and colors
    by class label.

    Args:
        features_data: data-set of features
        nPC: number of principal components
        firstPC: integer denoting first principal component to plot in bi-plot
        secondPC: integer denoting second principal component to plot in bi-plot
        classes: {'UN_Intermediate_Region', 'Income group (World Bank)'}, as string, class label
            by which to color observations
    Returns:
        plot (interactive)
        pca_loadings (if show_loadings=True)
    """
    
    # Unit of observation is the reporting country
    obs='reporter.ISO'
    
    # Run PCA (standardize beforehand)
    features_data_std = StandardScaler().fit_transform(features_data)
#     features_data_std = power_transform(features_data, method='yeo-johnson', standardize=True)
    pca = PCA(n_components=nPC, random_state=234)
    princ_comp = pca.fit_transform(features_data_std)

    # Loadings
    cols = ['PC' + str(c+1) for c in np.arange(nPC)]
    pca_loadings = pd.DataFrame(pca.components_.T, 
                                columns=cols,
                                index=list(features_data.columns))
   
    # Scores
    pca_scores = pd.DataFrame(princ_comp, 
                              columns=cols)
    pca_scores[obs] = features_data.reset_index()[obs].values.tolist()
    pca_scores['year'] = features_data.reset_index()['year'].values.tolist()
    
    score_PC1 = princ_comp[:,firstPC-1]
    score_PC2 = princ_comp[:,secondPC-1]
    
    # Plot data
    plot_data = pd.merge(pca_scores, obs_info, on=[obs, 'year'])
    tooltip_obs = ['reporter', 'year', 'Income group (World Bank)', 'Country status (UN)']
    
    obs_labels = region_labels(features_data)
    plot_data = pd.merge(left=plot_data, right=obs_labels.drop_duplicates('reporter.ISO'), on='reporter.ISO')
    
    if classes == 'UN_Intermediate_Region':
        color_scheme = 'paired'
    else:
        color_scheme = 'dark2'

    # Return chosen PCs to plot
    PC1 = 'PC'+str(firstPC)
    PC2 = 'PC'+str(secondPC)
    
    # Top loadings (in absolute value)
    # TO DO: use dict to iterate over
    toploadings_PC1 = pca_loadings.apply(lambda x: abs(x)).sort_values(by=PC1).tail(3)[[PC1, PC2]]
    toploadings_PC2 = pca_loadings.apply(lambda x: abs(x)).sort_values(by=PC2).tail(3)[[PC1, PC2]]

    originsPC1 = pd.DataFrame({'index':toploadings_PC1.index.tolist(), 
                               PC1: np.zeros(3), 
                               PC2: np.zeros(3)})
    originsPC2 = pd.DataFrame({'index':toploadings_PC2.index.tolist(), 
                               PC1: np.zeros(3), 
                               PC2: np.zeros(3)})
    
    toploadings_PC1 = pd.concat([toploadings_PC1.reset_index(), originsPC1], axis=0)
    toploadings_PC2 = pd.concat([toploadings_PC2.reset_index(), originsPC2], axis=0)

    toploadings_PC1[PC1] = toploadings_PC1[PC1]*max(score_PC1)*1.5
    toploadings_PC1[PC2] = toploadings_PC1[PC2]*max(score_PC2)*1.5
    toploadings_PC2[PC1] = toploadings_PC2[PC1]*max(score_PC1)*1.5
    toploadings_PC2[PC2] = toploadings_PC2[PC2]*max(score_PC2)*1.5
    
    # Project top 3 loadings over the space spanned by 2 principal components
    lines = alt.Chart().mark_line().encode()
    for color, i, dataset in zip(['#440154FF', '#21908CFF'], [0,1], [toploadings_PC1, toploadings_PC2]):
        lines[i] = alt.Chart(dataset).mark_line(color=color).encode(
        x= PC1 +':Q',
        y= PC2 +':Q',
        detail='index'
    ).properties(
        width=400,
        height=400
    )
    
    # Add labels to the loadings
    text=alt.Chart().mark_text().encode()
    for color, i, dataset in zip(['#440154FF', '#21908CFF'], [0, 1], [toploadings_PC1[0:3], toploadings_PC2[0:3]]):
        text[i] = alt.Chart(dataset).mark_text(
                align='left',
                baseline='bottom',
                color=color
            ).encode(
                x= PC1 +':Q',
                y= PC2 +':Q',
                text='index'
            )
    
    # Scatter plot colored by observation class label
    points = alt.Chart(plot_data).mark_circle(size=60).encode(
        x=alt.X(PC1, axis=alt.Axis(title='Principal Component ' + str(firstPC))),
        y=alt.X(PC2, axis=alt.Axis(title='Principal Component ' + str(secondPC))),
        color=alt.Color(classes, scale=alt.Scale(scheme=color_scheme),
               legend=alt.Legend(orient='right')),
        tooltip=tooltip_obs
    ).interactive()
    
    chart = (points + lines[0] + lines[1] + text[0] + text[1])    
    chart.display()

The following graphs suggest that the first two components of PCA seem to partition the feature space remarkably well according to regional groupings, and to a lesser extent according to income grouping. It should be noted that there is less variation in the income groupings in Africa as a whole, with few countries belong to the high income group.

The biplot which uses African partner countries as the feature space (after removing outliers) and which colors by regional grouping is the most striking (scroll down below to see it). Recall that the loadings refer to the partner countries (i.e. the destinations of the illicit outflows) which have the heaviest loadings on the first/second principal components, while the observations are colored according to the reporter country (i.e. the origin of the illicit flow). This biplot suggests that the partner features which have the highest weight on the first principal components correspond to a western African hub (Cote d’Ivoire and Togo), and that those observations with the highest principal component scores in that rotated space correspond to western African reporting countries.

Similarly, the largest weights on the second principal component refer to a southern African partner hub (South Africa, Malawi, and Botswana), while the observations with the highest scores correspond to northern and southern African countries.

Therefore, it appears that there are strong sub-regional dynamics in intra-African illicit financial flows. The policy implication of this finding is that existing sub-regional policy initiatives could be leveraged to combat illicit financial flows within the continent. For example, the Economic Community of West African States (ECOWAS) could anchor policy efforts in western Africa while the East African Community (EAC) could attend to the east.

For sector features

biplot_PCA_classes(sector_features, 21, 1, 2, classes='UN_Intermediate_Region')

biplot_PCA_classes(sector_features, 21, 1, 2, classes='Income group (World Bank)')

For partner features

biplot_PCA_classes(partner_features, 46, 1, 2)

biplot_PCA_classes(partner_features, 46, 1, 2, classes='Income group (World Bank)')

For partner features, restricted to African partners

biplot_PCA_classes(partner_features_AFR, 46, 1, 2)

biplot_PCA_classes(partner_features_AFR, 46, 1, 2, classes='Income group (World Bank)')

For partner features, restricted to African partners, without outliers

biplot_PCA_classes(partner_features_AFR_noout, 46, 1, 2)

biplot_PCA_classes(partner_features_AFR_noout, 46, 1, 2, classes='Income group (World Bank)')

Clustering¶

Agglomerative clustering¶

This section investigates clustering techniques to see if they recover some of the groupings observed above.

# Run PCA on first 2 principal components on intra-African partners, without outliers
features_data_std = StandardScaler().fit_transform(partner_features_AFR_noout)
pca = PCA(n_components=10)
princ_comp = pca.fit_transform(features_data_std)
X = princ_comp[:,[0,1]]

# Perform agglomerative clustering
from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering(n_clusters=5, linkage='ward').fit(X)
plt.scatter(X[:, 0], X[:, 1], c=clustering.labels_);
plt.title('Agglomerative clustering using Ward linkage');

_images/Unsupervised-learning-trade_191_0.png

Agglomerative clustering with Ward linkage results in better identified clusters on the first two principal components (compared to single, complete, or average linkage; code omitted).

However, the groupings do not correspond to the ground truth labels of either region or income grouping. Let’s briefly investigate the observation labels in each cluster to see if discernable patterns appear. The labels in the first cluster are given by:

# Show labels in first cluster
partner_features_AFR_noout.index[np.where(clustering.labels_ == 0)[0]]

MultiIndex([('AGO', 2011),
            ('AGO', 2012),
            ('BEN', 2013),
            ('BEN', 2014),
            ('BFA', 2014),
            ('CIV', 2008),
            ('CIV', 2009),
            ('CIV', 2010),
            ('CIV', 2011),
            ('CIV', 2012),
            ('CIV', 2014),
            ('CMR', 2012),
            ('CMR', 2014),
            ('DZA', 2014),
            ('DZA', 2015),
            ('DZA', 2016),
            ('EGY', 2015),
            ('ETH', 2012),
            ('GHA', 2000),
            ('GHA', 2001),
            ('GHA', 2010),
            ('GHA', 2013),
            ('MLI', 2007),
            ('MLI', 2010),
            ('MLI', 2011),
            ('MLI', 2012),
            ('NGA', 2013)],
           names=['reporter.ISO', 'year'])

Let’s extract the unique country labels per cluster.

# Countries in first cluster
cluster_labels = partner_features_AFR_noout.index[np.where(clustering.labels_ == 0)[0]].to_list()
country_labels = [i[0] for i in cluster_labels]
set(country_labels)

{'AGO', 'BEN', 'BFA', 'CIV', 'CMR', 'DZA', 'EGY', 'ETH', 'GHA', 'MLI', 'NGA'}

No immediately discernable pattern appears in the first cluster. This is the case for the remaining clusters too (code hidden).

# Countries in second cluster
cluster_labels = partner_features_AFR_noout.index[np.where(clustering.labels_ == 1)[0]].to_list()
country_labels = [i[0] for i in cluster_labels]
set(country_labels)

{'AGO',
 'BEN',
 'BFA',
 'CIV',
 'CMR',
 'COG',
 'DZA',
 'EGY',
 'ETH',
 'GAB',
 'GHA',
 'GIN',
 'GMB',
 'MAR',
 'MLI',
 'MRT',
 'NER',
 'NGA',
 'SDN',
 'SEN',
 'TGO',
 'TUN',
 'ZAF'}

# Countries in third cluster
cluster_labels = partner_features_AFR_noout.index[np.where(clustering.labels_ == 2)[0]].to_list()
country_labels = [i[0] for i in cluster_labels]
set(country_labels)

{'AGO',
 'BDI',
 'EGY',
 'MWI',
 'NGA',
 'RWA',
 'SDN',
 'TZA',
 'UGA',
 'ZAF',
 'ZMB',
 'ZWE'}

# Countries in fourth cluster
cluster_labels = partner_features_AFR_noout.index[np.where(clustering.labels_ == 3)[0]].to_list()
country_labels = [i[0] for i in cluster_labels]
set(country_labels)

{'BDI',
 'BWA',
 'EGY',
 'KEN',
 'MDG',
 'MOZ',
 'MWI',
 'NAM',
 'RWA',
 'TZA',
 'UGA',
 'ZAF',
 'ZMB',
 'ZWE'}

# Countries in fifth cluster
cluster_labels = partner_features_AFR_noout.index[np.where(clustering.labels_ == 4)[0]].to_list()
country_labels = [i[0] for i in cluster_labels]
set(country_labels)

{'BDI',
 'BEN',
 'BFA',
 'BWA',
 'CAF',
 'CIV',
 'CMR',
 'COG',
 'COM',
 'CPV',
 'DZA',
 'EGY',
 'ETH',
 'GAB',
 'GHA',
 'GIN',
 'GMB',
 'GNB',
 'KEN',
 'LBY',
 'LSO',
 'MAR',
 'MDG',
 'MLI',
 'MOZ',
 'MRT',
 'MUS',
 'MWI',
 'NAM',
 'NER',
 'NGA',
 'RWA',
 'SDN',
 'SEN',
 'SLE',
 'STP',
 'SWZ',
 'SYC',
 'TGO',
 'TUN',
 'TZA',
 'UGA',
 'ZAF',
 'ZMB',
 'ZWE'}

Hierarchical clustering and corresponding heatmap¶

Rather than specifying ex ante the number of clusters, I now perform hierarchical clustering, restricting myself to illicit trade in 2016. I have explored various transformations, scalings, and types of linkages. I find that a (modified) log transformation on standardized data (with mean 0 and unit variance) with Ward linkage (which minimizes the sum of squared differences within all clusters) performs rather well.

# Create data for African bilateral trade matrix in 2016
data = partner_features_AFR_noout.reset_index() \
    .query('year == "2016"').drop(columns='year').set_index('reporter.ISO')

# Select the columns which have at least 1 non-zero row 
data = data.loc[:,(data != 0).any(axis=0)]

# Log the data
data = data.apply(lambda x: np.log10(x+1) if np.issubdtype(x.dtype, np.number) else x, axis=0)

# Scale the data
scaler = StandardScaler()
# scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data)
data_scaled = pd.DataFrame(data_scaled, index=data.index, columns=data.columns)

# Extract region labels for the observations
obs_labels = region_labels(data).set_index('reporter.ISO')

# Extract unique regions
color_labels = obs_labels['UN_Intermediate_Region'].unique()
color_labels = np.sort(color_labels)

# List of RGB triplets for the region labels
rgb_values = sns.color_palette('Paired', 5)

# Map region labels to RGB triplet
color_map = dict(zip(color_labels, rgb_values))
region_color = pd.DataFrame(obs_labels)['UN_Intermediate_Region'].map(color_map)

The reporting countries are plotted in the rows of the heatmap below, while the partner countries (restricted to non-outlying African countries) are displayed in the columns. Dendograms for each dimension are displayed. In addition, the first column colors the observations according to their ground truth regional label.

On the origins dimension (i.e. reporters), the two highest-level clusters correspond broadly to western African (in pink) countries on the one hand, and eastern (in light blue) and middle African (in dark blue) countries on the other.

# Plot heatmap and corresponding dendograms
g = sns.clustermap(data_scaled,
                   method='ward',
                   cmap='bone_r',
                   row_colors=region_color,
                  );
g.fig.suptitle('Hierarchical clustering with Ward linkage', y=1);

_images/Unsupervised-learning-trade_206_0.png

Network analysis¶

The matrix of bilateral mis-invoiced imports between African reporter countries and partner countries can be conceptualized as a weighted directed graph. Recall that the data-set IFF_Dest contains the mis-invoiced imports and exports for a reporter-partner-year tuple. The underlying data has been aggregated across commodity sectors using the Gross Excluding Reversals (GER) method, which ignores all inflows of capital (i.e. negative values), and simply sums up the positive outflows across trading partners.

The table below shows the IFF_Dest data-set that we are working with:

	reporter	reporter.ISO	rIncome	rDev	year	partner	partner.ISO	pRegion	pIncome	pDev	Imp_IFF_lo	Imp_IFF_hi	Exp_IFF_lo	Exp_IFF_hi
0	Algeria	DZA	UMC	Developing	2001	Andorra	AND	Europe	HIC	Developed	7.925637e+03	1.609561e+04	NaN	NaN
1	Algeria	DZA	UMC	Developing	2001	Argentina	ARG	Americas	HIC	Developing	1.914238e+07	4.717459e+07	NaN	NaN
2	Algeria	DZA	UMC	Developing	2001	Australia	AUS	Oceania	HIC	Developed	1.589172e+06	2.027641e+07	NaN	1412.374399
3	Algeria	DZA	UMC	Developing	2001	Austria	AUT	Europe	HIC	Developed	1.511117e+07	7.641706e+07	NaN	NaN
4	Algeria	DZA	UMC	Developing	2001	Belgium	BEL	Europe	HIC	Developed	9.357981e+06	2.285729e+07	NaN	NaN

Thus, the values in the import column of IFF_Dest correspond to import over-invoicing which result in an illicit outflow from reporter $i$ to partner $j$. Therefore, we can consider each entry in IFF_Dest to represent an edge going out from the reporter into the partner country.

Data wrangling¶

Let’s download some indicators from the World Bank that we can use as node attributes.

The covariates control-corruption and rule-of-law refer to the percentile rank that each country achieves on these governance dimensions. The control of corruption measure captures perceptions of the extent to which public power is exercised for private gain, including both petty and grand forms of corruption, as well as “capture” of the state by elites and private interests. The rule of law variable captures perceptions of the extent to which agents have confidence in and abide by the rules of society, and in particular the quality of contract enforcement, property rights, the police, and the courts, as well as the likelihood of crime and violence. The underlying source for these data is the World Bank’s Worldwide Governance Indicators (WGI) database.

Nota bene: these will be used in further work and are not exploited in this project.

The variable gdp-pc measures each country’s GDP per capita in current US $ and is sourced from the World Bank’s World Development Indicators (WDI) database.

# Specify indicators of interest to import from World Bank
indicators = {'CC.PER.RNK': 'control-corruption', 
              'RL.PER.RNK': 'rule-of-law',
              'NY.GDP.PCAP.CD': 'gdp-pc'}

# Import covariates
covariates = wb.get_dataframe(indicators,  
                              data_date=(datetime.datetime(2016, 1, 1)))

# Merge in ISO codes from the crosswalk
covariates = pd.merge(left=covariates.reset_index().rename(columns={'index': 'country'}), 
                      right=crosswalk[['country', 'ISO3166.3']], 
                      how='left', on='country').rename(columns={'date': 'year'})
covariates = covariates.dropna(subset=['ISO3166.3']).set_index('ISO3166.3')
covariates

	country	control-corruption	rule-of-law	gdp-pc
ISO3166.3
AFG	Afghanistan	3.365385	5.769231	547.228110
ALB	Albania	40.865380	41.826920	4124.055726
DZA	Algeria	27.884610	18.750000	3946.443977
ASM	American Samoa	87.500000	87.980770	11696.955562
AND	Andorra	87.500000	90.865390	37474.665406
...	...	...	...	...
VIR	Virgin Islands (U.S.)	72.115390	73.076920	35931.541252
PSE	West Bank and Gaza	51.923080	40.865380	3074.291152
YEM	Yemen, Rep.	1.442308	3.365385	1033.733043
ZMB	Zambia	41.346150	43.269230	1280.578898
ZWE	Zimbabwe	9.615385	8.173077	1464.583529

223 rows × 4 columns

Now, let’s create an auxiliary function that transforms IFF_Dest into an appropriate format for network analysis, and applies some thresholding.

The function create_graph_data() accomplishes the following:

Extract data from a specific year, in order to perform network analysis at a specific snapshot in time.
Ignore illicit flows that are less than a certain value specifid by flow_threshold. The default is to ignore flows that are less than $10,000. This is necessary because the mis-invoicing estimation procedure, by construction, will locate IFF as discrepancies between a country’s reported imports and the mirror exports declared by its partner country (after adjustment). But any discrepancy, no matter how small, would result in the creation of an edge between a given dyad. If no thresholding on the value of the mis-invoiced flow were applied, then the directed graph would simply recover the topology of international trade networks.
Further restrict the graph, if threshold is True, to reporting countries which are considered “conduits” because they have the highest value of IFF in relation to either their GDP or total trade value.

def create_graph_data(year='2016', threshold=True, threshold_var='GDP', flow_threshold=10000):
    """
    Returns data-set to be used to generate a directed graph.
    
    year: as string, specify which year to use for the network analysis (between 2000-2016)
    threshold: as boolean, indicate whether to restrict reporting countries to conduits
    threshold_var: as string, indicate which variable to use when determining which 
        reporting countries are conduits ('GDP' or 'trade')
    flow_threshold: as numeric, specify the cut-off under which to ignore the dollar values
        of mis-invoiced imports
    """
    flow_data = IFF_Dest.reset_index().query('year == @year')
    flow_data = flow_data.loc[flow_data['Imp_IFF_hi'].notnull(), :]
    flow_data = flow_data.query('Imp_IFF_hi >= @flow_threshold')
    
    if threshold:
        conduits = 'conduits_' + threshold_var
        flow_data = flow_data[flow_data['reporter.ISO'].isin(eval(conduits).index)]
    return flow_data

Thresholding¶

The figure below displays the distribution of over-invoiced imports by African countries in 2016, as a proportion of their GDP and of their total trade value, respectively. The distribution of outflows relative to trade value has a higher variance than that of outflows relative to GDP, with some countries’ experiencing up to 30% of mis-invoicing in their total trade with the rest of the world.

# Import data on mis-invoiced trade aggregated by destination and sector for each African country
IFF_Year = pd.read_csv('Data/GER_Orig_Year_Africa.csv')

# Restrict data to 2016 as an illustrative example
IFF_Year = IFF_Year.query('year == "2016"')

# Plot distribution of proportional mis-invoiced imports for each African country
sns.distplot(IFF_Year['Tot_IFF_hi_GDP'].apply(lambda x: x*100), 
             kde=True, label='% GDP', bins=10);
sns.distplot(IFF_Year['Tot_IFF_hi_trade'].apply(lambda x: x*100),
             kde=True, label='% trade', bins=10);
plt.legend()
plt.title('Distribution of IFF in Africa in 2016')
plt.xlabel('Mis-invoiced trade for African countries');

_images/Unsupervised-learning-trade_221_0.png

Mean outflow as proportion of GDP: 0.06371070220761896 
Mean outflow as proportion of trade 0.006736362224603777

Those countries that experience disproportionately high outflows can be considered as “conduits” on which we can perform the network analysis.

We can use the figure above to inform our choice of thresholds for the proportion of IFF relative to GDP and trade, respectively, above which countries are considered conduits. Those thresholds are assigned to the variables thresh_GDP and thresh_trade, respectively.

thresh_GDP = 0.1
thresh_trade = 0.17
conduits_GDP = IFF_Year.query('Tot_IFF_hi_GDP >= @thresh_GDP').set_index('reporter.ISO')
conduits_trade = IFF_Year.query('Tot_IFF_hi_trade >= @thresh_trade').set_index('reporter.ISO')

The conduit countries respective to GDP are the Seychelles, where illicit outflows represented 22% of GDP in 2016, followed by Mozambique, Mali, and Tunisia.

	reporter	year	Tot_IFF_hi_GDP	GDP
reporter.ISO
MLI	Mali	2016	0.102023	1.401079e+10
MOZ	Mozambique	2016	0.141781	1.098136e+10
SYC	Seychelles	2016	0.216881	1.427525e+09
TUN	Tunisia	2016	0.100490	4.180838e+10

The countries which experienced the highest level of mis-invoicing relative to their trading activity in 2016 were Burundi, where up to 31% of trade was illicit, followed by Uganda and Mali. Mali and Mozambique are conduit countries both when considering outflows relative to GDP and relative to trade openness.

	reporter	year	Tot_IFF_hi_trade	Total_value
reporter.ISO
DZA	Algeria	2016	0.175611	77082785056
BDI	Burundi	2016	0.312707	754378325
MLI	Mali	2016	0.210566	6788467564
MOZ	Mozambique	2016	0.180047	8647421818
NGA	Nigeria	2016	0.179441	68077346474
STP	São Tomé and Príncipe	2016	0.204593	149825956
UGA	Uganda	2016	0.233074	7802352503

Finally, in addition to restricting our graph to certain conduit countries where the capital flows out of, we should also threshold on the partner countries to which the mis-invoiced capital flows in to. The figure below illustrates the distribution of illicit inflows as a proportion of their GDP for each partner country included in our bilateral flow data. That is, the weight on the edge from reporter $i$ to partner $j$ is relative to each partner $j$’s GDP in the chosen year. The magnitudes are smaller than the previous figure which considers the distribution of illicit outflows for each reporter $i$ aggregated over all partners $j$.

# Import data where value of illicit flow is standardized for partners
IFF_std = pd.read_csv('Data/GER_Orig_Dest_Year_std.csv')
IFF_std = IFF_std.query('year == "2016"')

# Merge in with flow data
flow_data = pd.merge(left=create_graph_data('2016', threshold_var='GDP'),
                     right=IFF_std[['reporter.ISO', 'partner.ISO', 'pImp_IFF_hi_GDP']],
                     on=['reporter.ISO', 'partner.ISO'])

# Plot distribution of IFF in partner countries
sns.distplot(flow_data['pImp_IFF_hi_GDP'], kde=True);
plt.title('Distribution of IFF in partner countries in 2016 (dyad-level)')
plt.xlabel('Mis-invoiced trade as proportion of GDP in partner countries');

_images/Unsupervised-learning-trade_230_0.png

The distribution of outflows is heavily right-skewed, where the yearly outflow in 2016 from a single country to their partner resulted in an inflow to said partner which sometimes represented up to 2% of their GDP.

The helper function threshold_partner_IFF() thus helps us further threshold our bilateral flow data from the partner country perspective. By default, it restricts those bilateral flows to transactions where the inflow to the partner represented at least 0.01% of their GDP.

def threshold_partner_IFF(flow_data, year='2016', partner_threshold=0.0001):
    """
    Filters bilateral flow data-set to minimum level of illicit flow relative to partner GDP.
    
    flow_data: as Pandas dataframe, name of data-set which contains bilateral flow data
        (in wide format)
    year: as string, specify which year to use for the network analysis (between 2000-2016)
    partner_threshold: as numeric, specify the cut-off for the minimum proportion of partner
        GDP that a bilateral flow must represent in order to be included
    """  
    IFF_std = pd.read_csv('Data/GER_Orig_Dest_Year_std.csv')
    IFF_std = IFF_std.query('year == @year')
    flow_data = pd.merge(left=flow_data,
                     right=IFF_std[['reporter.ISO', 'partner.ISO', 'pImp_IFF_hi_GDP']],
                     on=['reporter.ISO', 'partner.ISO'])
    flow_data = flow_data.query('pImp_IFF_hi_GDP >= @partner_threshold')
    return flow_data

Let’s now create the directed graph data for countries which are conduits for illicit outflows relative to GDP and relative to trade value, respectively, and further restrict the data to edges where the mis-invoiced trade represents at least 0.01% of partner GDP.

Directed graph for GDP¶

Let’s create a directed graph for the flow data restricted to the top origins of illicit outflows relative to GDP, and applying the standard thresholding described above.

# Graph data for conduits relative to GDP
flow_data_GDP = create_graph_data('2016', threshold_var='GDP')
flow_data_GDP = threshold_partner_IFF(flow_data_GDP)

# Create directed graph
graph = nx.from_pandas_edgelist(flow_data_GDP,
                                'reporter.ISO',
                                'partner.ISO', 
                                'Imp_IFF_hi',
                                create_using = nx.DiGraph())

This graph contains 33 vertices (or nodes) and 36 edges.

# Check the number of nodes and edges in the graph
len(graph.nodes()), len(graph.edges())

(33, 36)

# Print some of the edges in the network
list(graph.edges(data=True))[0:10]

[('MLI', 'GHA', {'Imp_IFF_hi': 8081515.07373014}),
 ('MLI', 'SEN', {'Imp_IFF_hi': 449650318.36780304}),
 ('MLI', 'ZAF', {'Imp_IFF_hi': 66862258.2240592}),
 ('MLI', 'TGO', {'Imp_IFF_hi': 9367585.98900624}),
 ('MLI', 'MRT', {'Imp_IFF_hi': 1267279.60791437}),
 ('MOZ', 'NAM', {'Imp_IFF_hi': 1165110.23690534}),
 ('MOZ', 'NLD', {'Imp_IFF_hi': 128444043.062896}),
 ('MOZ', 'PRT', {'Imp_IFF_hi': 121508025.10009299}),
 ('MOZ', 'SGP', {'Imp_IFF_hi': 466068356.77873796}),
 ('MOZ', 'ZAF', {'Imp_IFF_hi': 246975704.837192})]

In a directed graph, the degree of a node is the sum of its indegree (the number of edges going in to the node) and its outdegree (the number of edges going out of the node). In this case, the GDP conduit countries of Mali, Mozambique, Seychelles, and Tunisia will have a degree of a least 1 (corresponding to the outdegree of at least one edge going out).

# Get the degree of the nodes
graph.degree

DiDegreeView({'MLI': 5, 'GHA': 1, 'SEN': 1, 'ZAF': 3, 'TGO': 1, 'MOZ': 7, 'NAM': 1, 'NLD': 2, 'PRT': 2, 'SGP': 2, 'SYC': 3, 'MUS': 2, 'TUN': 21, 'DZA': 1, 'AZE': 1, 'BEL': 1, 'BGR': 1, 'HRV': 1, 'DNK': 1, 'EGY': 1, 'FRA': 1, 'HUN': 1, 'ITA': 1, 'JOR': 1, 'MYS': 1, 'MRT': 2, 'PRY': 1, 'RUS': 1, 'SVK': 1, 'SVN': 1, 'TUR': 1, 'UKR': 1, 'FIN': 1})

The function below achieves the following:

color the country nodes according to their GDP per capita;
color the edges according to the size of the logged illicit flow;
size the nodes proportionately to their degree (for use in the map case);
only label nodes with an outdegree of at least 1, that is, only the origin conduit countries selected by the thresholding above will have labels (for use in the map case).

def set_graph_attributes(graph):
    """
    Sets node attributes and create auxiliary variables to be used in graph visualization.
    
    Returns:
    col: list of values to color nodes according to (node GDP per capita)
    edge_col: array of values to color edges according to (logged flow between nodes)
    sizes: list of values to size nodes (proportional to degree)
    labels: dict of node labels (where a node is labelled if outdegree is at least 1)
    """
    # Create dictionary of GDP per capita for each country
    GDP_attr = covariates.loc[:, ['gdp-pc']]
    GDP_attr = GDP_attr.to_dict('index')
    # Set GDP per capita as a node attribute
    nx.set_node_attributes(graph, GDP_attr)

    # Create list of colors for nodes in the graph
    col = [nx.get_node_attributes(graph, 'gdp-pc')[n] for n in graph.nodes]

    # Create list of colors for edges in the graph
    edge_col = [nx.get_edge_attributes(graph, 'Imp_IFF_hi')[e] for e in graph.edges]
    edge_col = np.log(edge_col)

    # Extract outdegree (the number of edges coming out of nodes) and degree
    outdeg = graph.out_degree
    deg = graph.degree
    # Size of nodes will be proportional to their degree
    sizes = [10 * deg[c] for c in graph.nodes]
    # Label the countries if their outdegree is at least 1, i.e. if they are the reporting African countries
    labels = {c: c if outdeg[c] >= 1 else ''
              for c in graph.nodes}
    return col, edge_col, outdeg, deg, sizes, labels

# Generate graph attributes and auxiliary variables
col, edge_col, outdeg, deg, sizes, labels = set_graph_attributes(graph)

We can now draw the network of illicit flows from the top 4 origins of illicit outflows with respect to GDP.

# Draw directed graph and color nodes by GDP per capita
plt.figure(figsize = (10,8))

pos = nx.spring_layout(graph)
nodes = nx.draw_networkx_nodes(graph, pos,
                               node_color=col, cmap=plt.cm.spring_r)
edges = nx.draw_networkx_edges(graph, pos,
                               edge_color=edge_col, 
                               edge_cmap=plt.cm.get_cmap('RdPu'))
thous_fmt = FuncFormatter(lambda x, p: format(int(x), ','))
nx.draw_networkx_labels(graph, pos, font_size=10)
clb = plt.colorbar(nodes, format=thous_fmt)
clb.ax.set_title('GDP per capita');

_images/Unsupervised-learning-trade_248_0.png

While the network visualization above is informative, the position of the nodes is generated using a force-directed algorithm. However, the nodes have an intuitive positioning in cartographical coordinates. Therefore, I exploit this implicit positioning and project the nodes above in a space spanned by longitude and latitude.

Thus, let us extract the longitude and latitude of the countries in order to position the nodes on the map.

# Create dictionary of longitude and latitude for countries
geo_pos = {country: (v['Longitude'], v['Latitude'])
           for country, v in
           crosswalk.drop_duplicates('ISO3166.3').set_index('ISO3166.3').to_dict('index').items()
}
list(geo_pos.items())[0:10]

[('ABW', (-69.98267711, 12.52088038)),
 ('AFG', (66.00473366, 33.83523073)),
 ('AGO', (17.53736768, -12.29336054)),
 ('AIA', (-63.06498927, 18.2239595)),
 ('ALA', (19.95328768, 60.21488688)),
 ('ALB', (20.04983396, 41.14244989)),
 ('AND', (1.56054378, 42.54229102)),
 ('ANT', (nan, nan)),
 ('ARE', (54.300167099999996, 23.90528188)),
 ('ARG', (-65.17980692, -35.3813488))]

# Map projection
crs = ccrs.PlateCarree()
fig, ax = plt.subplots(1, 1, figsize=(20, 8), subplot_kw=dict(projection=crs))
ax.coastlines(color='lightgray')
ax.add_feature(cfeature.LAND, facecolor='lightgray')
ax.add_feature(cfeature.BORDERS, edgecolor='white')
# Uncomment for the map to span the whole world
# ax.set_extent([-160, 180, -60, 90], crs=ccrs.PlateCarree())

# Overlay network graph
nx.draw_networkx(graph, ax=ax,
                 pos=geo_pos,
                 font_size=14,
                 alpha=0.7,
                 node_color=col, cmap=plt.cm.spring_r,
                 edge_color=edge_col, edge_cmap=plt.cm.get_cmap('RdPu'),
                 node_size=sizes,
                 labels=labels)

_images/Unsupervised-learning-trade_252_0.png

Directed graph for trade¶

Now, let’s create a directed graph for the flow data restricted to the top origins of illicit outflows relative to trade value, and applying the standard thresholding described above.

# Graph data for conduits relative to trade
flow_data_trade = create_graph_data('2016', threshold_var='trade')
flow_data_trade = threshold_partner_IFF(flow_data_trade)

# Create directed graph
graph = nx.from_pandas_edgelist(flow_data_trade,
                                'reporter.ISO',
                                'partner.ISO', 
                                'Imp_IFF_hi',
                                create_using = nx.DiGraph())

This graph contains 70 vertices and 99 edges.

# Check the number of nodes and edges in the graph
len(graph.nodes()), len(graph.edges())

(70, 99)

# Generate graph attributes and auxiliary variables
col, edge_col, outdeg, deg, sizes, labels = set_graph_attributes(graph)

The network of illicit flows from the top origins of illicit outflows relative to trade value is displayed below.

# Draw directed graph and color nodes by GDP per capita
plt.figure(figsize = (10,8))

pos = nx.spring_layout(graph)
nodes = nx.draw_networkx_nodes(graph, pos,
                               node_color=col, cmap=plt.cm.spring_r)
edges = nx.draw_networkx_edges(graph, pos,
                               edge_color=edge_col, 
                               edge_cmap=plt.cm.get_cmap('RdPu'))
thous_fmt = FuncFormatter(lambda x, p: format(int(x), ','))
nx.draw_networkx_labels(graph, pos, font_size=10)
clb = plt.colorbar(nodes, format=thous_fmt)
clb.ax.set_title('GDP per capita');

_images/Unsupervised-learning-trade_261_0.png

Finally, the figure below visualizes the directed graph on a map.

# Map projection
crs = ccrs.PlateCarree()
fig, ax = plt.subplots(1, 1, figsize=(20, 8), subplot_kw=dict(projection=crs))
ax.coastlines(color='lightgray')
ax.add_feature(cfeature.LAND, facecolor='lightgray')
ax.add_feature(cfeature.BORDERS, edgecolor='white')
# Uncomment for the map to span the whole world
# ax.set_extent([-160, 180, -60, 90], crs=ccrs.PlateCarree())

# Overlay network graph
nx.draw_networkx(graph, ax=ax,
                 pos=geo_pos,
                 font_size=14,
                 alpha=0.7,
                 node_color=col, cmap=plt.cm.spring_r,
                 edge_color=edge_col, edge_cmap=plt.cm.get_cmap('RdPu'),
                 node_size=sizes,
                 labels=labels)

_images/Unsupervised-learning-trade_263_0.png

Evolution of the network during 2000-2016¶

The gif below displays the evolution of the network of illicit trade originating from Africa during 2000-2016 (i.e. the period for which we have data). The darkness of the edges (reflecting the logged dollar value of illicit trade) increased markedly during 2008-2009, perhaps reflecting the commodity supercycle at the time. If prices of internationally traded commodities increased, then it stands to reason that the dollar value of mis-invoiced trade increased as well.

Europe is a destination hub for illicit outflows originating in Africa. The size of the nodes is proportional to their degree. The size of European nodes increased during 2008-2012 and decreased in 2015-2016.

# Create network maps of illicit flows for 2000-2016
for y in range(2000, 2017, 1):

    # Entire network of illicit flows in 2016
    flow_data_whole = create_graph_data(str(y), threshold=False, flow_threshold=1000000)

    # Create directed graph
    graph = nx.from_pandas_edgelist(flow_data_whole,
                                    'reporter.ISO',
                                    'partner.ISO', 
                                    'Imp_IFF_hi',
                                    create_using = nx.DiGraph())

    # Set map attributes
    col, edge_col, outdeg, deg, sizes, labels = set_graph_attributes(graph)

    # Map projection
    crs = ccrs.PlateCarree()
    fig, ax = plt.subplots(1, 1, figsize=(20, 8), subplot_kw=dict(projection=crs))
    ax.coastlines(color='lightgray')
    ax.add_feature(cfeature.LAND, facecolor='lightgray')
    ax.add_feature(cfeature.BORDERS, edgecolor='white')
    # Uncomment for the map to span the whole world
    ax.set_extent([-160, 180, -60, 90], crs=ccrs.PlateCarree())

    # Overlay network graph
    nx.draw_networkx(graph, ax=ax,
                     pos=geo_pos,
                     font_size=14,
                     alpha=0.7,
                     node_color=col, cmap=plt.cm.spring_r,
                     edge_color=edge_col, edge_cmap=plt.cm.get_cmap('RdPu'),
                     node_size=sizes,
                     labels=labels);
    fig.suptitle('Network of illicit outflows originating in Africa in ' + str(y), fontsize=20)
    plt.savefig('Images/Network_Maps/network_map_' + str(y) + '.png')
    plt.clf();

# Create gif from network map images
filenames = ['Images/Network_Maps/network_map_' + str(y) + '.png' for y in range(2000, 2017, 1)]
images = []
for filename in filenames:
    images.append(imageio.imread(filename))
imageio.mimsave('Images/Network_Maps/Maps.gif', images, duration=2)

Conclusion¶

This project has utilized the newest estimates of trade mis-invoicing of the United Nations Economic Commission for Africa (http://doi.org/10.5281/zenodo.3610558). The data contains estimates of the dollar value of illicit trade in both imports and exports for all reporting jurisdictions in the UN COMTRADE database, for 21 commodity sectors during the period 2000-2016.

Following the establishment of the High Level Panel on Illicit Financial Flows from Africa in 2015, the scale and severity of the problem has been well-recognized by policy-makers and NGOs. The next steps now are to move beyond scale estimates and to start understanding the nature, mechanisms, and directions of illicit trade on the continent. This project takes the first steps towards that goal.

I have utilized unsupervised machine learning techniques such as dimension reduction, clustering, and network analysis to extract meaningful insights from the data. Given that the feature space is large, especially when analyzing the data in its dyadic form, dimension reduction techniques are necessary to project the data in a lower-dimensional space.

Following that, I explored some clustering techniques to ascertain whether they were able to recover some of the ground truth labels of the observations such as regional groups or income classification. Cluster validation will require further work.

Finally, I exploited the bilateral nature of the data and perfomed a preliminary network analysis for illicit trade originating from African countries. Given the methodology used to estimate mis-invoicing, an edge would be created between a reporter-partner pair for any discrepancy in mirrored trade statistics. Therefore, it is necessary to apply thresholding before conducting network analysis. Further work will need to examine network statistics such as centrality measures in order to yield further insights.

Policy recommendations

Target 16.4 of the SDGs: “By 2030, significantly reduce illicit financial and arms flows, strengthen the recovery and return of stolen assets and combat all forms of organized crime”
Estimates of the magnitude and severity of the problem have shed a spotlight on the need to combat illicit financial flows
Deeper understanding of the origins and destinations is now crucial to guiding targeted interventions to combat illicit flows
Strong subregional effects are apparent in illicit trade within the continent. Initiatives to combat illicit flows could be anchored within the Regional Economic Communities of the African Union, such as ECOWAS.

Next steps

I plan to take this project further. Notably, I plan to conduct analysis on the most disaggregated view of the data, that is, for a reporter-partner-year-commodity tuple. By filtering by commodity sector, I will be able to identify the relevant sinks and sources.

Moreover, I am currently exploring spectral clustering. However, spectral clustering algorithms are currently implemented for undirected rather than directed graphs. There are several possible approaches to dealing with clustering on directed graphs. One of them includes a naive approach where direction is ignored, and where the graph is treated as an undirected network.

Finally, I am considering non-negative matrix factorization as an alternative form of dimension reduction.

Illicit Financial Flows

Table of Contents

Trade mis-invoicing in Africa¶

Goals¶

Data wrangling¶

Mis-invoiced trade data (for African reporting countries)¶

Mis-invoiced trade data (bilateral trade matrix)¶

Crosswalk data¶

Auxiliary functions¶

PCA on feature space (for individual reporting countries)¶

Sector features¶

Biplots¶

Explained variance¶

Variance-stabilizing and normalizing transformations¶

Country features¶

Biplots¶

Explained variance¶

Variance-stabilizing and normalizing transformations¶

PCA on bilateral trade matrix¶

Biplots¶

Explained variance¶

Variance-stabilizing and normalizing transformations¶

Intra-African illicit financial flows¶

Data wrangling¶

Removing outliers¶

Coloring by class label¶

Clustering¶

Agglomerative clustering¶

Hierarchical clustering and corresponding heatmap¶

Network analysis¶

Data wrangling¶

Thresholding¶

Directed graph for GDP¶

Directed graph for trade¶

Evolution of the network during 2000-2016¶

Conclusion¶