Table of Contents

Trade mis-invoicing in Africa

Policy and media attention on illicit financial flows (IFF) has increased, with the recognition that Africa is a net creditor to the world.

NYT-2013 Guardian-2015
New York Times (2013) Guardian (2015)
Guardian-2017 Economist-2019
Guardian (2017) Economist (2019)

What is trade mis-invoicing?

  • The deliberate mis-statement of price or quantity of internationally traded goods in invoices presented to customs

  • Can occur at import or export

  • Can result in an inflow or outflow of money

Motivations for trade mis-invoicing include:

  • Evading tariffs

  • Exploiting subsidy regimes

  • Subverting forex and capital controls

  • Hiding transfers of wealth

Mechanisms of mis-invoicing

From the reporting country’s perspective, trade mis-invoicing can result in an inflow or an outflow of capital, and this can be achieved by misreporting the value of both imports and exports. Money can be moved out of the country by over-invoicing imports, where that country pays too much money to buy goods from its partner; or by under-invoicing exports, where that country does not charge enough money for the goods that it sells to its partner. Conversely, money can be illicitly routed in to that country by under-invoicing imports, where the country pays too little money to buy goods from its partner; or by over-invoicing exports, where the country charges too much for the goods that it sells to its partner. The direction of trade mis-invoicing in addition to its mechanisms is represented in the diagram below.


Why does trade mis-invoicing matter?

  • Outflows undermine the fiscal base and public spending

  • Financing gap needed to meet the Sustainable Development Goals (SDGs)

  • Combating trade mis-invoicing is crucial for the mobilization of domestic resources in the continent, and can catalyze sustainable development

Governance loop (credit: William Davis)

Data source

  • Trade mis-invoicing data-set of the United Nations Economic Commission for Africa

  • Panel with \(n=6,248,254\) of mis-invoiced trade between 179 reporting jurisdictions and 179 partner countries for years 2000-2016 and disaggregated commodities

  • Citation: Lépissier, Alice, Davis, William, & Ibrahim, Gamal. (2019). Trade Mis-Invoicing Dataset (Version 1). DOI

  • The entire data-set panel_results.Rdata with ~6 million observations is available online.

  • The unit of observation is a reporter-partner-commodity-year tuple, where the data represent the illicit flow embedded in a transaction between a reporter \(i\) and a partner \(j\) for a commodity \(c\) in year \(t\).

  • The full panel is ~2GB. This notebook works with smaller summary data-sets generated from the full panel. The code to generate these summary data-sets is available on the repository by running the script file Compute IFF Estimates.R.

Nota bene: the figures in this notebook that are not generated here can be reconstructed by running the Data Visualization.R script in that same repository.

Methodology for calculating mis-invoiced trade

  • We locate mis-invoicing in the discrepancies between reported trade flows and their mirrored statistics

  • But not all observed discrepancies are due to illicit motives!

  • Imports tend to be recorded on the basis of Cost of Insurance and Freight (CIF), while exports are recorded Free On Board (FOB)

Our approach

  1. Estimate the discrepancies between imports and mirror net exports as a function of both licit and illicit predictors

  2. Perform a harmonization procedure in order to generate a reconciled value that represents our best estimate of what the legitimate value of the trade should be on a FOB basis

  3. Calculate the IFF embedded in each transaction as the difference between the observed value and the reconciled value

What this data-set represents

  • The data-set presents estimates of mis-invoicing both in imports and exports. I will be working with the import data.

  • There is a low and a high variant of mis-invoicing estimates. Discussion of how they vary is beyond the scope of this project, but note that we will be working with the high variant exclusively, as those are the official numbers of UNECA.

  • A negative value represents an illicit inflow, while a positive value represents an illicit outflow.

  • The panel can be aggregated over several dimensions, e.g. partner country, commodity, year. There are two methods to aggregate up the illicit flow:

    • Net basis: simply add up all negative and positive values, so that inflows and outflows cancel each other out

    • Gross Excluding Reversals (GER) basis: ignore all inflows, and sum up the positive values only across trading partners

Aggregation strategy

It is important to note that for a given country pair \(i\) and \(j\) in a given year \(t\), the same trade flow can be associated with either an inflow or an outflow according to what commodity is traded. While it might seem unlikely that illicit funds might be traveling in both directions for the same trade flow, there could be a variety of different actors doing this for different reasons. For example, country \(i\) might have export taxes on raw materials and export subsidies for manufacturing output, which would give an incentive to under-invoice exports of raw materials (resulting in an illicit outflow) and to over-invoice exports of manufactured goods (resulting in an illicit inflow). Alternatively, a criminal syndicate that has a legitimate front company may use re-invoicing to send money to an affiliate in another country to make an investment (e.g. hiring “muscle” to fight off a competitor) and then bring funds back using exports to the same country when the investment bears fruit.

The object of analytical inquiry should guide the choice of aggregation strategy. For example, stakeholders interested in getting a picture of the total amount of funds departing a country on balance should favor a net aggregation basis. By contrast, stakeholders interested in better understanding the drivers and mechanisms of IFF should favor aggregation using GER in order to identify where money is flowing out or in.

In this project, I will be using unsupervised machine learning to extract insights on analytically relevant dimensions of variation, so I will use summary data-sets that have been aggregated on a Gross Excluding Reversals basis. In other words, I am more interested in the direction and topology of the illicit flows, rather than their magnitude.

Zoom in on Africa

  • During 2000-2016, the continent lost on average $83 billion a year in gross illicit outflows

  • Net cumulative flows during that period were $362 billion

Africa-map Source: generated by Data Visualization.R in


While generating estimates of the dollar value of illicit trade has been helpful to shed light on the severity of the problem, the next step in the analysis is to further understand the nature of the illicit activity in terms of its origins, destinations, and sectors.

Therefore, the goal of this project is to extract meaningful insights on illicit trade using unsupervised machine learning techniques. By doing so, I can identify analytically relevant categories and dimensions of variation, in order to generate hypotheses and guide further work.

This project will apply the following techniques to the data:

  1. Dimension reduction using Principal Components Analysis (PCA)

  2. Clustering

  3. Graph analysis

Data wrangling

I will work with two summary forms of the full data-set panel_results.Rdata: one where the data is aggregated up to the reporter country, and one where the dyadic nature of the data is preserved.

In the case where the data is aggregated up to the reporter country, there are two possible views of the feature space. On the one hand, the feature space can be spanned by the sectors of the different commodities (\(p=21\)). On the other hand, the feature space can be spanned by the reporter African countries (\(p=46\)).

  1. Mis-invoiced trade for countries (aggregated data)

    • View 1: unit of observation = reporter-year; features = sectors

    • View 2: unit of observation = sector-year; features = reporters

  2. Bilateral matrix of mis-invoiced trade (dyadic data)

    • Unit of observation = reporter-partner-year

Mis-invoiced trade data (for African reporting countries)

Let’s read in a summary data-set of the full panel which contains mis-invoicing estimates for reporter \(i\) in commodity \(c\) in year \(t\) (aggregated over partner countries \(j\) using the GER strategy).

IFF_Sector = pd.read_csv('Data/GER_Orig_Sect_Year_Africa.csv')

Let’s also extract the relevant labels for the countries in the data-set.

obs_info = IFF_Sector.reset_index().drop_duplicates(['reporter.ISO', 'year'])[['reporter.ISO', 
obs_info = obs_info.replace({'LIC': 'Low income', 
                             'LMC': 'Lower-middle income',
                             'UMC': 'Upper-middle income',
                             'HIC': 'High income'})
obs_info = obs_info.rename(columns={'rIncome': 'Income group (World Bank)', 
                                    'rDev': 'Country status (UN)'})

Mis-invoiced trade for countries by sectors

The table below displays the raw data of mis-invoiced trade for 46 African countries in 21 different sectors during 2000-2016, aggregated over partner countries.

section Imp_IFF_hi
reporter.ISO year
DZA 2001 Animal and Animal Products 1.914633e+08
2001 Pulp of Wood or of Other Fibrous Material 7.436143e+07
2001 Textiles 3.560644e+07
2001 Footwear and Headgear 1.146507e+06
2001 Stone, Glass, and Ceramics 2.935880e+07
... ... ... ...
ZWE 2007 Arms and Ammunition 0.000000e+00
2009 Works of Art 0.000000e+00
2013 Miscellaneous Manufactured Articles 0.000000e+00
2014 Miscellaneous Manufactured Articles 0.000000e+00
2014 Mineral Products 0.000000e+00

11133 rows × 2 columns

Mis-invoiced trade data (bilateral trade matrix)

Now, let’s read in another summary data-set of the full panel which contains mis-invoicing estimates for reporter \(i\) with partner \(j\) in year \(t\) (aggregated over commodities \(c\) using the GER strategy).

IFF_Dest = pd.read_csv('Data/GER_Orig_Dest_Year_Africa.csv')

Mis-invoiced trade for dyads

The table below displays the raw data of mis-invoiced trade for 46 African countries with 167 partner countries during 2000-2016, aggregated over commodity sectors.

partner.ISO Imp_IFF_hi
reporter.ISO year
DZA 2001 AND 1.609561e+04
2001 ARG 4.717459e+07
2001 AUS 2.027641e+07
2001 AUT 7.641706e+07
2001 BEL 2.285729e+07
... ... ... ...
ZWE 2014 MDG 0.000000e+00
2014 SGP 0.000000e+00
2014 CHE 0.000000e+00
2014 ARE 0.000000e+00
2015 UGA 0.000000e+00

34757 rows × 2 columns

Crosswalk data

Finally, let’s import a table that contains labels for the countries, such as their ISO codes, geographic regions, and income groupings that they belong to.

Metadata for countries

crosswalk = pd.read_excel("Data/crosswalk.xlsx").rename(columns={'Country': 'country'})
ISO3166.3 ISO3166.2 country UN_Region UN_Region_Code UN_Sub-region UN_Sub-region_Code UN_Intermediate_Region UN_Intermediate_Region_Code UN_M49_Code ... WB_Income_Group_Code WB_Region WB_Lending_Category WB_Other OECD EU28 Arab League Commonwealth Longitude Latitude
0 ABW AW Aruba Americas 19.0 Latin America and the Caribbean 419.0 Caribbean 29.0 533.0 ... HIC Latin America and Caribbean .. NaN 0 0 0 0 -69.982677 12.520880
1 AFG AF Afghanistan Asia 142.0 Southern Asia 34.0 NaN NaN 4.0 ... LIC South Asia IDA HIPC 0 0 0 0 66.004734 33.835231
2 AFG AF Afghanistan, Islamic Republic of Asia 142.0 Southern Asia 34.0 NaN NaN 4.0 ... LIC South Asia IDA HIPC 0 0 0 0 66.004734 33.835231
3 AGO AO Angola Africa 2.0 Sub-Saharan Africa 202.0 Middle Africa 17.0 24.0 ... LMC Sub-Saharan Africa IBRD NaN 0 0 0 0 17.537368 -12.293361
4 AIA AI Anguila Americas 19.0 Latin America and the Caribbean 419.0 Caribbean 29.0 660.0 ... NaN NaN NaN NaN 0 0 0 0 -63.064989 18.223959

5 rows × 25 columns

Auxiliary functions

I create three auxiliary functions to aid me in running Principal Component Analysis:

  1. A function which creates the feature space for PCA from a data-set in a long format (which is the format of IFF_Sector and IFF_Dest)

  2. A function which runs PCA and plots a biplot for two chosen principal components

  3. A function which plots a cumulative scree plot

def create_features(data, values, features, obs):
    Convert data-set in long format to wide and preserve information on year.
    data: {IFF_Sector_Imp, IFF_Dest_Imp, IFF_Dest_AFR, ...}, as Pandas dataframe, 
        name of data-set from which to create feature space, must be in long format
    values: {'Imp_IFF_hi', 'Exp_IFF_hi'}, as string, values that data-set will represent
    features: {'reporter.ISO', 'section', 'partner.ISO'}, as string, 
        what to use as the feature space
    obs: {'section', 'reporter.ISO'}, as string, what to use as the observation level
    features_data = data.pivot_table(values=values, 
                                     index=[obs, 'year'], 
    return features_data
# Extra function to run the analysis on average mis-invoicing over the years
def create_features_mean(data, values, features, obs):
    Convert data-set in long format to wide.
    data: {IFF_Sector_mean, IFF_Dest_mean, ...}, as Pandas dataframe, 
        name of data-set from which to create feature space, must be in long format
    values: {'Imp_IFF_hi', 'Exp_IFF_hi'}, as string, values that data-set will represent
    features: {'reporter.ISO', 'section', 'partner.ISO'}, as string, 
        what to use as the feature space
    obs: {'section', 'reporter.ISO'}, as string, what to use as the observation level
    features_data = data.pivot_table(values=values, 
    return features_data
def biplot_PCA(features_data, nPC=2, firstPC=1, secondPC=2, obs='reporter.ISO', show_loadings=False):
    Project the data in the 2-dimensional space spanned by 2 principal components
    chosen by the user, along with a bi-plot of the top 3 loadings per PC, and color observations
    by class label.

        features_data: as Pandas dataframe, data-set of features
        nPC: number of principal components
        firstPC: integer denoting first principal component to plot in bi-plot
        secondPC: integer denoting second principal component to plot in bi-plot
        obs: string denoting index of class labels (in features_data)
        show_loadings: Boolean indicating whether PCA loadings should be displayed
        plot (interactive)
        pca_loadings (if show_loadings=True)
    # Run PCA (standardize data beforehand)
    features_data_std = StandardScaler().fit_transform(features_data)
    pca = PCA(n_components=nPC, random_state=234)
    princ_comp = pca.fit_transform(features_data_std)

    # Extract PCA loadings
    cols = ['PC' + str(c+1) for c in np.arange(nPC)]
    pca_loadings = pd.DataFrame(pca.components_.T, 
    # Extract PCA scores
    pca_scores = pd.DataFrame(princ_comp, 
    pca_scores[obs] = features_data.reset_index()[obs].values.tolist()
    pca_scores['year'] = features_data.reset_index()['year'].values.tolist()
    score_PC1 = princ_comp[:,firstPC-1]
    score_PC2 = princ_comp[:,secondPC-1]
    # Generate plot data
    if obs == 'reporter.ISO':
        plot_data = pd.merge(pca_scores, obs_info, on=[obs, 'year'])
        color_obs = 'reporter'
        tooltip_obs = ['reporter', 'year', 'Income group (World Bank)', 'Country status (UN)']
        plot_data = pca_scores
        color_obs = 'section'
        tooltip_obs = ['section', 'year']

    # Return chosen PCs to plot
    PC1 = 'PC'+str(firstPC)
    PC2 = 'PC'+str(secondPC)

    # Extract top loadings (in absolute value)
    # TO DO: use dict to iterate over
    toploadings_PC1 = pca_loadings.apply(lambda x: abs(x)).sort_values(by=PC1).tail(3)[[PC1, PC2]]
    toploadings_PC2 = pca_loadings.apply(lambda x: abs(x)).sort_values(by=PC2).tail(3)[[PC1, PC2]]

    originsPC1 = pd.DataFrame({'index':toploadings_PC1.index.tolist(), 
                               PC1: np.zeros(3), 
                               PC2: np.zeros(3)})
    originsPC2 = pd.DataFrame({'index':toploadings_PC2.index.tolist(), 
                               PC1: np.zeros(3), 
                               PC2: np.zeros(3)})
    toploadings_PC1 = pd.concat([toploadings_PC1.reset_index(), originsPC1], axis=0)
    toploadings_PC2 = pd.concat([toploadings_PC2.reset_index(), originsPC2], axis=0)

    toploadings_PC1[PC1] = toploadings_PC1[PC1]*max(score_PC1)*1.5
    toploadings_PC1[PC2] = toploadings_PC1[PC2]*max(score_PC2)*1.5
    toploadings_PC2[PC1] = toploadings_PC2[PC1]*max(score_PC1)*1.5
    toploadings_PC2[PC2] = toploadings_PC2[PC2]*max(score_PC2)*1.5
    # Project top 3 loadings over the space spanned by 2 principal components
    lines = alt.Chart().mark_line().encode()
    for color, i, dataset in zip(['#440154FF', '#21908CFF'], [0,1], [toploadings_PC1, toploadings_PC2]):
        lines[i] = alt.Chart(dataset).mark_line(color=color).encode(
        x= PC1 +':Q',
        y= PC2 +':Q',
    # Add labels to the loadings
    for color, i, dataset in zip(['#440154FF', '#21908CFF'], [0, 1], [toploadings_PC1[0:3], toploadings_PC2[0:3]]):
        text[i] = alt.Chart(dataset).mark_text(
                x= PC1 +':Q',
                y= PC2 +':Q',
    # Scatter plot colored by observation class label
    points = alt.Chart(plot_data).mark_circle(size=60).encode(
        x=alt.X(PC1, axis=alt.Axis(title='Principal Component ' + str(firstPC))),
        y=alt.X(PC2, axis=alt.Axis(title='Principal Component ' + str(secondPC))),
        color=alt.Color(color_obs, scale=alt.Scale(scheme='category20b'),
    # Bind it all together
    chart = (points + lines[0] + lines[1] + text[0] + text[1])    

    if show_loadings:
        return pca_loadings
def scree_plot(features_data, show_explained_var=False):
    Create a cumulative scree splot and (optional) return the explained variance by each component.
    features_data: as Pandas dataframe, the data-set on which to run PCA
    show_explained_var: as Boolean, flag for whether to return explained variance
    features_data_std = StandardScaler().fit_transform(features_data)
    pca = PCA(n_components=features_data_std.shape[1], random_state=234)
    princ_comp = pca.fit_transform(features_data_std)
    explained_var = pd.DataFrame({'PC': np.arange(1,features_data_std.shape[1]+1),
                                  'var': pca.explained_variance_ratio_,
                                  'cumvar': np.cumsum(pca.explained_variance_ratio_)})

    # Adapted from
    # Create a selection that chooses the nearest point & selects based on x-value
    nearest = alt.selection(type='single', nearest=True, on='mouseover',
                            fields=['PC'], empty='none')

    # The basic line
    line = alt.Chart(explained_var).mark_line(interpolate='basis', color='#FDE725FF').encode(
            scale=alt.Scale(domain=(1, len(explained_var))),
            axis=alt.Axis(title='Principal Component')
            scale=alt.Scale(domain=(min(explained_var['cumvar']), 1)),
            axis=alt.Axis(title='Cumulative Variance Explained')

    # Transparent selectors across the chart. This is what tells us
    # the x-value of the cursor
    selectors = alt.Chart(explained_var).mark_point().encode(

    # Draw points on the line, and highlight based on selection
    points = line.mark_point().encode(
        opacity=alt.condition(nearest, alt.value(1), alt.value(0))

    # Draw text labels near the points, and highlight based on selection
    text = line.mark_text(align='left', dx=5, dy=-5).encode(
        text=alt.condition(nearest, 'cumvar:Q', alt.value(' '))

    # Draw a rule at the location of the selection
    rules = alt.Chart(explained_var).mark_rule(color='gray').encode(

    # Put the five layers into a chart and bind the data
    out = alt.layer(
        line, selectors, points, rules, text
        title='Cumulative scree plot',
        width=500, height=300
    if show_explained_var:
        return explained_var[['PC', 'var']]

PCA on feature space (for individual reporting countries)

I now run PCA on the data-set IFF_Sector which includes mis-invoiced trade for a country \(i\) in commodity \(c\) in year \(t\).

Sector features

One possible view of IFF_Sector is to view the sectors as the feature space where \(p=21\). Therefore, I use the function create_features() to generate the data-set to be used in PCA analysis.

sector_features = create_features(IFF_Sector_Imp, 'Imp_IFF_hi', 
                                  features='section', obs='reporter.ISO')
section Animal and Animal Products Animal or Vegetable Fats and Oils Arms and Ammunition Base Metals Chemicals and Allied Industries Footwear and Headgear Machinery and Electrical Mineral Products Miscellaneous Manufactured Articles Pearls, Precious Stones and Metals ... Precision Instruments Prepared Foodstuffs Pulp of Wood or of Other Fibrous Material Raw Hides, Skins, Leather, and Furs Stone, Glass, and Ceramics Textiles Transportation Vegetable Products Wood and Wood Products Works of Art
reporter.ISO year
AGO 2009 2.694946e+08 7.652455e+07 67738.261242 1.025920e+09 5.115176e+08 2.920410e+07 1.525512e+09 1.727293e+09 2.472936e+08 5.575633e+05 ... 1.242156e+08 4.781184e+08 1.149992e+08 1.097769e+07 1.349018e+08 1.827787e+08 2.198838e+09 4.249722e+08 4.337043e+07 177640.929155
2010 2.064906e+08 7.311396e+07 0.000000 9.989842e+08 2.575443e+08 7.853065e+06 1.308337e+09 2.953387e+09 8.404302e+07 3.209177e+05 ... 1.115183e+08 2.342889e+08 7.332134e+07 4.013044e+06 7.601278e+07 8.285704e+07 9.611656e+08 1.959128e+08 2.310136e+07 404828.231824
2011 2.995594e+08 8.058180e+07 89365.880480 6.689511e+08 2.931591e+08 1.172881e+07 1.389743e+09 2.286539e+09 4.413341e+07 9.645181e+05 ... 1.418956e+08 3.159980e+08 9.911835e+07 9.995669e+06 6.607126e+07 7.912925e+07 7.070474e+08 3.734589e+08 1.363601e+07 586586.948365
2012 7.103770e+08 2.746924e+08 261265.895948 1.753970e+09 8.540407e+08 3.823724e+07 2.790368e+09 9.203666e+08 1.763904e+08 1.271289e+06 ... 2.073111e+08 1.025668e+09 2.641831e+08 1.394356e+07 1.630307e+08 1.683128e+08 1.975553e+09 6.541154e+08 3.980884e+07 739122.050820
2013 4.031560e+08 1.541679e+08 188343.533148 9.230300e+08 4.793908e+08 1.212363e+07 1.428283e+09 2.009915e+09 4.871315e+07 6.250082e+05 ... 1.063123e+08 4.931687e+08 1.798533e+08 8.930246e+06 7.882264e+07 7.858108e+07 4.962937e+09 3.995114e+08 1.836680e+07 418841.640492
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZWE 2011 3.098093e+07 6.060789e+07 9124.882478 1.693777e+08 2.260227e+09 2.817909e+06 3.231869e+08 2.747704e+08 9.872441e+06 3.618592e+06 ... 2.136919e+07 1.144497e+08 5.855387e+07 6.259512e+05 2.652474e+07 4.726781e+07 8.963644e+08 1.874874e+08 6.016952e+06 35481.251510
2012 2.784481e+07 5.549232e+07 856.631347 1.085453e+08 4.929549e+08 7.015919e+06 3.451931e+08 1.461844e+08 1.510668e+07 2.876014e+05 ... 3.162707e+07 2.025934e+08 6.024629e+07 1.412797e+06 2.482629e+07 5.147542e+07 9.468142e+08 1.900125e+08 7.901902e+06 59677.045599
2013 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000
2014 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000
2015 1.458046e+07 7.135043e+07 19192.507302 8.703951e+07 2.299085e+08 4.468801e+06 3.437244e+08 1.401916e+08 2.432263e+07 3.672531e+07 ... 3.233606e+07 3.778318e+07 3.157795e+07 1.061420e+06 1.724449e+07 2.841926e+07 2.137692e+08 1.585549e+08 4.569325e+06 6305.070602

624 rows × 21 columns


The figure below displays a biplot of the first 2 principal components, in addition to reporting the loadings for all principal components in the following table.

biplot_PCA(sector_features, 10, 1, 2, obs='reporter.ISO', show_loadings=True)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Animal and Animal Products 0.223097 -0.285686 -0.052218 0.200337 0.280789 -0.075192 -0.159824 0.157986 -0.133621 0.067581
Animal or Vegetable Fats and Oils 0.145805 -0.041750 0.262100 -0.281551 -0.378544 -0.470151 0.409970 0.471670 0.177591 -0.031639
Arms and Ammunition 0.030559 -0.044358 0.524677 -0.654182 0.434354 0.055627 -0.201368 -0.113006 -0.079142 -0.115541
Base Metals 0.235966 -0.206534 -0.261180 -0.147777 -0.019680 -0.153272 0.017922 -0.247084 0.174715 -0.069873
Chemicals and Allied Industries 0.271238 0.107591 -0.117222 -0.089472 -0.096976 -0.130645 -0.071165 -0.219172 -0.198166 0.036517
Footwear and Headgear 0.206357 0.378112 0.015170 -0.019961 -0.001383 -0.029500 -0.042735 0.010544 0.026510 -0.252435
Machinery and Electrical 0.272101 0.160466 0.076229 0.091921 -0.067637 -0.159494 0.022549 -0.135941 -0.157103 -0.115312
Mineral Products 0.238785 0.071925 -0.098357 -0.201011 -0.166463 0.073908 -0.133348 0.121464 -0.241190 0.675042
Miscellaneous Manufactured Articles 0.246123 0.283544 0.122775 0.033592 -0.049743 -0.014818 -0.017630 0.069480 -0.247438 -0.043076
Pearls, Precious Stones and Metals 0.060193 0.251657 -0.373387 -0.153599 0.567863 0.055855 0.647708 0.088517 -0.049224 0.089860
Plastics and Rubbers 0.264093 -0.144764 0.124109 0.236111 0.112342 0.023331 0.062706 0.076398 0.119468 -0.076649
Precision Instruments 0.223754 0.362288 -0.013533 0.023957 -0.084425 -0.090796 -0.080419 -0.186415 -0.262231 -0.078130
Prepared Foodstuffs 0.252722 -0.227483 -0.012139 0.160146 0.177798 -0.024717 -0.088546 0.289203 -0.187744 -0.058613
Pulp of Wood or of Other Fibrous Material 0.272304 -0.112890 -0.010570 0.074023 0.024044 -0.016865 0.003323 -0.123810 0.135842 0.200798
Raw Hides, Skins, Leather, and Furs 0.206454 0.084348 0.178427 0.121865 -0.101317 0.619649 0.102592 0.287981 -0.057302 -0.251296
Stone, Glass, and Ceramics 0.225603 0.111290 0.264504 0.269760 0.061210 -0.018303 0.176815 -0.368090 0.400752 -0.018317
Textiles 0.209936 -0.103101 -0.054321 -0.264334 -0.273480 0.542184 0.126078 -0.076346 0.216014 0.139611
Transportation 0.238443 -0.119628 0.256225 0.081261 0.184539 -0.059237 0.015553 -0.095856 0.219874 0.334486
Vegetable Products 0.245486 -0.284943 -0.122225 -0.029458 0.071482 -0.036325 -0.074223 0.202113 -0.106308 -0.240274
Wood and Wood Products 0.201178 -0.222154 -0.377723 -0.300290 -0.132165 0.007197 -0.048891 -0.185469 0.074259 -0.351532
Works of Art 0.089055 0.388758 -0.239308 -0.078967 0.157495 -0.041382 -0.486227 0.369405 0.561157 0.018339

The first two principal components can be interpreted as recovering variation in the size of the illicit flow. The top loadings on the first principal component include Machinery and Electrical, and Chemicals and Allied Industries, which are the sectors with the highest amount of mis-invoiced trade. The treemap figure below represents the average gross yearly outflow to different sectors in Africa, and we can see that the aforementioned sectors account for a large chunk of that.

By contrast, the loadings on the second component are associated with sectors which are specialized, precision sectors that do not account for a large amount of mis-invoicing, compared to others. Indeed, Works of Art, and Footwear and Headgear are in the top-right corner of the treemap figure (and do not even appear with labels as they are so small).

Thus, it seems like the magnitude of the trade mis-invoicing (in dollar values) explains the highest variability in the data.

treemap-sectors Source: generated by Data Visualization.R in

biplot_PCA(sector_features, 10, 3, 4, obs='reporter.ISO')

The sixth principal component (displayed in the figure below) captures variation with respect to the natural origin of the commodity being traded and its functional use. The top loadings include sectors where the commodity is generated from natural resources (e.g. animals or vegetables) and which is functionally used in clothing, cooking, and cosmetics (e.g. textiles, leathers, oils).

biplot_PCA(sector_features, 10, 5, 6, obs='reporter.ISO')

Explained variance

The first 6 principal components examined above explain 83% of the variance in the data. The first principal component, which captures magnitude of the flow, accounts of 52% of the variance alone.

scree_plot(sector_features, show_explained_var=True)
PC var
0 1 0.524988
1 2 0.121415
2 3 0.054159
3 4 0.049898
4 5 0.042107
5 6 0.041007
6 7 0.036014
7 8 0.028847
8 9 0.024146
9 10 0.018279
10 11 0.012558
11 12 0.009987
12 13 0.007011
13 14 0.006890
14 15 0.005213
15 16 0.004848
16 17 0.004001
17 18 0.002819
18 19 0.002489
19 20 0.001800
20 21 0.001524

Variance-stabilizing and normalizing transformations

The figure below displays the distribution of mis-invoicing for the 21 different sectors. It is heavily right-skewed across all sectors, with very few high-value mis-invoiced transactions.


I explore two possible transformations of the data: a (modified) log transformation and a Yeo–Johnson. The values of the data-set are all non-negative, with some values equal to 0. Therefore, the figure below displays the distribution of the data after taking \(log(x+1)\).

sector_features_log = sector_features.apply(lambda x: np.log(x+1) if np.issubdtype(x.dtype, np.number) else x, axis=0)

The Yeo-Johnson transformation is similar to a Box-Cox transformation, but allows for zero values. When \(x=0\) and the tuning parameter \(\lambda=0\), it will apply the same modified log transformation, \(log(x=1)\), as described above.

sector_features_yeo = power_transform(sector_features, method='yeo-johnson', standardize=True)
sector_features_yeo = pd.DataFrame(sector_features_yeo,
fig, axes = joypy.joyplot(sector_features_yeo,, figsize=(8,8),
                          title='Distribution of mis-invoicing across sectors (Yeo–Johnson transformation)');

The two figures below display the results of running the PCA analysis on transformations of the data using a modified log and a Yeo-Johnson transformation, respectively.

The top loadings on the second principal component are the same and refer to Works of Art, Pearls, Precious Stones and Metals, and Arms and Ammunitions. These sectors refer to extremely high value-added sectors where the traded commodity is a finished product.

By contrast, the loadings on the first principal component (in either transformation) refer to heavy industries, where the commodity is an intermediate product (as opposed to a finished product). Even after applying normalizing transformations, the first principal component recovers sectors which were responsible for a large amount of the dollar value of mis-invoicing in the continent during 2000-2016.

biplot_PCA(sector_features_log, 10, 1, 2, obs='reporter.ISO')