Unsupervised learning on illicit trade data

Alice Lépissier

📧 alepissier@bren.ucsb.edu

Bren School of Environmental Science and Management

Policy and media attention on illicit financial flows (IFF) has increased, with the recognition that Africa is a net creditor to the world.

NYT-2013 Guardian-2015
New York Times (2013) Guardian (2015)
Guardian-2017 Economist-2019
Guardian (2017) Economist (2019)

What is trade mis-invoicing?

  • The deliberate mis-statement of price or quantity of internationally traded goods in invoices presented to customs
  • Can occur at import or export
  • Can result in an inflow or outflow of money

Motivations for trade mis-invoicing include:

  • Evading tariffs
  • Exploiting subsidy regimes
  • Subverting forex and capital controls
  • Hiding transfers of wealth

Mechanisms of mis-invoicing

conceptual-model

Why does trade mis-invoicing matter?

  • Outflows undermine the fiscal base and public spending
  • Financing gap needed to meet the Sustainable Development Goals (SDGs)
  • Combating trade mis-invoicing is crucial for the mobilization of domestic resources in the continent, and can catalyze sustainable development
governance-loop
Governance loop (credit: William Davis)

Data source

  • Trade mis-invoicing data-set of the United Nations Economic Commission for Africa
  • Panel with $n=6,248,254$ of mis-invoiced trade between 179 reporting jurisdictions and 179 partner countries for years 2000-2016 and disaggregated commodities
  • Citation: Lépissier, Alice, Davis, William, & Ibrahim, Gamal. (2019). Trade Mis-Invoicing Dataset (Version 1). DOI
  • The entire data-set panel_results.Rdata with ~6 million observations is available online.
  • The unit of observation is a reporter-partner-commodity-year tuple, where the data represent the illicit flow embedded in a transaction between a reporter $i$ and a partner $j$ for a commodity $c$ in year $t$.
  • The full panel is ~2GB. This notebook works with smaller summary data-sets generated from the full panel. The code to generate these summary data-sets is available on the repository https://github.com/walice/Trade-IFF by running the script file Compute IFF Estimates.R.

Methodology for calculating mis-invoiced trade

  • We locate mis-invoicing in the discrepancies between reported trade flows and their mirrored statistics
  • But not all observed discrepancies are due to illicit motives!
  • Imports tend to be recorded on the basis of Cost of Insurance and Freight (CIF), while exports are recorded Free On Board (FOB)

Our approach

  1. Estimate the discrepancies between imports and mirror net exports as a function of both licit and illicit predictors
  2. Perform a harmonization procedure in order to generate a reconciled value that represents our best estimate of what the legitimate value of the trade should be on a FOB basis
  3. Calculate the IFF embedded in each transaction as the difference between the observed value and the reconciled value
  • The panel can be aggregated over several dimensions, e.g. partner country, commodity, year. There are two methods to aggregate up the illicit flow:
    • Net basis: simply add up all negative and positive values, so that inflows and outflows cancel each other out
    • Gross Excluding Reversals (GER) basis: ignore all inflows, and sum up the positive values only across trading partners

Zoom in on Africa

  • During 2000-2016, the continent lost on average \$83 billion a year in gross illicit outflows
  • Net cumulative flows during that period were \$362 billion

Africa-map Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF

Goals

This project will apply the following techniques to the data:

  1. Dimension reduction using Principal Components Analysis (PCA)
  2. Clustering
  3. Graph analysis

Data wrangling

  1. Mis-invoiced trade for countries (aggregated data)
    • View 1: unit of observation = reporter-year; features = sectors
    • View 2: unit of observation = sector-year; features = reporters
  2. Bilateral matrix of mis-invoiced trade (dyadic data)
    • Unit of observation = reporter-partner-year

Mis-invoiced trade for countries by sectors

In [7]:
# Extract mis-invoicing in imports
IFF_Sector_Imp = IFF_Sector[['section', 'Imp_IFF_hi']]
IFF_Sector_Imp
Out[7]:
section Imp_IFF_hi
reporter.ISO year
DZA 2001 Animal and Animal Products 1.914633e+08
2001 Pulp of Wood or of Other Fibrous Material 7.436143e+07
2001 Textiles 3.560644e+07
2001 Footwear and Headgear 1.146507e+06
2001 Stone, Glass, and Ceramics 2.935880e+07
... ... ... ...
ZWE 2007 Arms and Ammunition 0.000000e+00
2009 Works of Art 0.000000e+00
2013 Miscellaneous Manufactured Articles 0.000000e+00
2014 Miscellaneous Manufactured Articles 0.000000e+00
2014 Mineral Products 0.000000e+00

11133 rows × 2 columns

Mis-invoiced trade for dyads

In [12]:
# Extract mis-invoicing in imports
IFF_Dest_Imp = IFF_Dest[['partner.ISO', 'Imp_IFF_hi']]
IFF_Dest_Imp
Out[12]:
partner.ISO Imp_IFF_hi
reporter.ISO year
DZA 2001 AND 1.609561e+04
2001 ARG 4.717459e+07
2001 AUS 2.027641e+07
2001 AUT 7.641706e+07
2001 BEL 2.285729e+07
... ... ... ...
ZWE 2014 MDG 0.000000e+00
2014 SGP 0.000000e+00
2014 CHE 0.000000e+00
2014 ARE 0.000000e+00
2015 UGA 0.000000e+00

34757 rows × 2 columns

Metadata for countries

In [14]:
crosswalk = pd.read_excel("Data/crosswalk.xlsx").rename(columns={'Country': 'country'})
crosswalk.head()
Out[14]:
ISO3166.3 ISO3166.2 country UN_Region UN_Region_Code UN_Sub-region UN_Sub-region_Code UN_Intermediate_Region UN_Intermediate_Region_Code UN_M49_Code ... WB_Income_Group_Code WB_Region WB_Lending_Category WB_Other OECD EU28 Arab League Commonwealth Longitude Latitude
0 ABW AW Aruba Americas 19.0 Latin America and the Caribbean 419.0 Caribbean 29.0 533.0 ... HIC Latin America and Caribbean .. NaN 0 0 0 0 -69.982677 12.520880
1 AFG AF Afghanistan Asia 142.0 Southern Asia 34.0 NaN NaN 4.0 ... LIC South Asia IDA HIPC 0 0 0 0 66.004734 33.835231
2 AFG AF Afghanistan, Islamic Republic of Asia 142.0 Southern Asia 34.0 NaN NaN 4.0 ... LIC South Asia IDA HIPC 0 0 0 0 66.004734 33.835231
3 AGO AO Angola Africa 2.0 Sub-Saharan Africa 202.0 Middle Africa 17.0 24.0 ... LMC Sub-Saharan Africa IBRD NaN 0 0 0 0 17.537368 -12.293361
4 AIA AI Anguila Americas 19.0 Latin America and the Caribbean 419.0 Caribbean 29.0 660.0 ... NaN NaN NaN NaN 0 0 0 0 -63.064989 18.223959

5 rows × 25 columns

Auxiliary functions

In [15]:
def create_features(data, values, features, obs):
    """
    Convert data-set in long format to wide and preserve information on year.
    
    data: {IFF_Sector_Imp, IFF_Dest_Imp, IFF_Dest_AFR, ...}, as Pandas dataframe, 
        name of data-set from which to create feature space, must be in long format
    values: {'Imp_IFF_hi', 'Exp_IFF_hi'}, as string, values that data-set will represent
    features: {'reporter.ISO', 'section', 'partner.ISO'}, as string, 
        what to use as the feature space
    obs: {'section', 'reporter.ISO'}, as string, what to use as the observation level
    """
    features_data = data.pivot_table(values=values, 
                                     columns=features, 
                                     index=[obs, 'year'], 
                                     fill_value=0)
    return features_data
In [17]:
def biplot_PCA(features_data, nPC=2, firstPC=1, secondPC=2, obs='reporter.ISO', show_loadings=False):
    """
    Project the data in the 2-dimensional space spanned by 2 principal components
    chosen by the user, along with a bi-plot of the top 3 loadings per PC, and color observations
    by class label.

    Args:
        features_data: as Pandas dataframe, data-set of features
        nPC: number of principal components
        firstPC: integer denoting first principal component to plot in bi-plot
        secondPC: integer denoting second principal component to plot in bi-plot
        obs: string denoting index of class labels (in features_data)
        show_loadings: Boolean indicating whether PCA loadings should be displayed
    Returns:
        plot (interactive)
        pca_loadings (if show_loadings=True)
    """
        
    # Run PCA (standardize data beforehand)
    features_data_std = StandardScaler().fit_transform(features_data)
    pca = PCA(n_components=nPC, random_state=234)
    princ_comp = pca.fit_transform(features_data_std)

    # Extract PCA loadings
    cols = ['PC' + str(c+1) for c in np.arange(nPC)]
    pca_loadings = pd.DataFrame(pca.components_.T, 
                                columns=cols,
                                index=list(features_data.columns))
   
    # Extract PCA scores
    pca_scores = pd.DataFrame(princ_comp, 
                              columns=cols)
    pca_scores[obs] = features_data.reset_index()[obs].values.tolist()
    pca_scores['year'] = features_data.reset_index()['year'].values.tolist()
    
    score_PC1 = princ_comp[:,firstPC-1]
    score_PC2 = princ_comp[:,secondPC-1]
    
    # Generate plot data
    if obs == 'reporter.ISO':
        plot_data = pd.merge(pca_scores, obs_info, on=[obs, 'year'])
        color_obs = 'reporter'
        tooltip_obs = ['reporter', 'year', 'Income group (World Bank)', 'Country status (UN)']
    else:
        plot_data = pca_scores
        color_obs = 'section'
        tooltip_obs = ['section', 'year']

    # Return chosen PCs to plot
    PC1 = 'PC'+str(firstPC)
    PC2 = 'PC'+str(secondPC)

    # Extract top loadings (in absolute value)
    # TO DO: use dict to iterate over
    toploadings_PC1 = pca_loadings.apply(lambda x: abs(x)).sort_values(by=PC1).tail(3)[[PC1, PC2]]
    toploadings_PC2 = pca_loadings.apply(lambda x: abs(x)).sort_values(by=PC2).tail(3)[[PC1, PC2]]

    originsPC1 = pd.DataFrame({'index':toploadings_PC1.index.tolist(), 
                               PC1: np.zeros(3), 
                               PC2: np.zeros(3)})
    originsPC2 = pd.DataFrame({'index':toploadings_PC2.index.tolist(), 
                               PC1: np.zeros(3), 
                               PC2: np.zeros(3)})
    
    toploadings_PC1 = pd.concat([toploadings_PC1.reset_index(), originsPC1], axis=0)
    toploadings_PC2 = pd.concat([toploadings_PC2.reset_index(), originsPC2], axis=0)

    toploadings_PC1[PC1] = toploadings_PC1[PC1]*max(score_PC1)*1.5
    toploadings_PC1[PC2] = toploadings_PC1[PC2]*max(score_PC2)*1.5
    toploadings_PC2[PC1] = toploadings_PC2[PC1]*max(score_PC1)*1.5
    toploadings_PC2[PC2] = toploadings_PC2[PC2]*max(score_PC2)*1.5
    
    # Project top 3 loadings over the space spanned by 2 principal components
    lines = alt.Chart().mark_line().encode()
    for color, i, dataset in zip(['#440154FF', '#21908CFF'], [0,1], [toploadings_PC1, toploadings_PC2]):
        lines[i] = alt.Chart(dataset).mark_line(color=color).encode(
        x= PC1 +':Q',
        y= PC2 +':Q',
        detail='index'
    ).properties(
        width=400,
        height=400
    )
    
    # Add labels to the loadings
    text=alt.Chart().mark_text().encode()
    for color, i, dataset in zip(['#440154FF', '#21908CFF'], [0, 1], [toploadings_PC1[0:3], toploadings_PC2[0:3]]):
        text[i] = alt.Chart(dataset).mark_text(
                align='left',
                baseline='bottom',
                color=color
            ).encode(
                x= PC1 +':Q',
                y= PC2 +':Q',
                text='index'
            )
    
    # Scatter plot colored by observation class label
    points = alt.Chart(plot_data).mark_circle(size=60).encode(
        x=alt.X(PC1, axis=alt.Axis(title='Principal Component ' + str(firstPC))),
        y=alt.X(PC2, axis=alt.Axis(title='Principal Component ' + str(secondPC))),
        color=alt.Color(color_obs, scale=alt.Scale(scheme='category20b'),
                       legend=alt.Legend(orient='right')),
        tooltip=tooltip_obs
    ).interactive()
    
    # Bind it all together
    chart = (points + lines[0] + lines[1] + text[0] + text[1])    
    chart.display()

    if show_loadings:
        return pca_loadings
In [18]:
def scree_plot(features_data, show_explained_var=False):
    """
    Create a cumulative scree splot and (optional) return the explained variance by each component.
    
    features_data: as Pandas dataframe, the data-set on which to run PCA
    show_explained_var: as Boolean, flag for whether to return explained variance
    """
    
    features_data_std = StandardScaler().fit_transform(features_data)
    pca = PCA(n_components=features_data_std.shape[1], random_state=234)
    princ_comp = pca.fit_transform(features_data_std)
    
    explained_var = pd.DataFrame({'PC': np.arange(1,features_data_std.shape[1]+1),
                                  'var': pca.explained_variance_ratio_,
                                  'cumvar': np.cumsum(pca.explained_variance_ratio_)})

    # Adapted from https://altair-viz.github.io/gallery/multiline_tooltip.html
    # Create a selection that chooses the nearest point & selects based on x-value
    nearest = alt.selection(type='single', nearest=True, on='mouseover',
                            fields=['PC'], empty='none')

    # The basic line
    line = alt.Chart(explained_var).mark_line(interpolate='basis', color='#FDE725FF').encode(
        alt.X('PC:Q',
            scale=alt.Scale(domain=(1, len(explained_var))),
            axis=alt.Axis(title='Principal Component')
        ),
        alt.Y('cumvar:Q',
            scale=alt.Scale(domain=(min(explained_var['cumvar']), 1)),
            axis=alt.Axis(title='Cumulative Variance Explained')
        ),
    )

    # Transparent selectors across the chart. This is what tells us
    # the x-value of the cursor
    selectors = alt.Chart(explained_var).mark_point().encode(
        x='PC:Q',
        opacity=alt.value(0),
    ).add_selection(
        nearest
    )

    # Draw points on the line, and highlight based on selection
    points = line.mark_point().encode(
        opacity=alt.condition(nearest, alt.value(1), alt.value(0))
    )

    # Draw text labels near the points, and highlight based on selection
    text = line.mark_text(align='left', dx=5, dy=-5).encode(
        text=alt.condition(nearest, 'cumvar:Q', alt.value(' '))
    )

    # Draw a rule at the location of the selection
    rules = alt.Chart(explained_var).mark_rule(color='gray').encode(
        x='PC:Q',
    ).transform_filter(
        nearest
    )

    # Put the five layers into a chart and bind the data
    out = alt.layer(
        line, selectors, points, rules, text
    ).properties(
        title='Cumulative scree plot',
        width=500, height=300
    )
    
    out.display()
    
    if show_explained_var:
        return explained_var[['PC', 'var']]

PCA on feature space (for individual reporting countries)

Sector features

In [19]:
sector_features = create_features(IFF_Sector_Imp, 'Imp_IFF_hi', 
                                  features='section', obs='reporter.ISO')
sector_features
Out[19]:
section Animal and Animal Products Animal or Vegetable Fats and Oils Arms and Ammunition Base Metals Chemicals and Allied Industries Footwear and Headgear Machinery and Electrical Mineral Products Miscellaneous Manufactured Articles Pearls, Precious Stones and Metals ... Precision Instruments Prepared Foodstuffs Pulp of Wood or of Other Fibrous Material Raw Hides, Skins, Leather, and Furs Stone, Glass, and Ceramics Textiles Transportation Vegetable Products Wood and Wood Products Works of Art
reporter.ISO year
AGO 2009 2.694946e+08 7.652455e+07 67738.261242 1.025920e+09 5.115176e+08 2.920410e+07 1.525512e+09 1.727293e+09 2.472936e+08 5.575633e+05 ... 1.242156e+08 4.781184e+08 1.149992e+08 1.097769e+07 1.349018e+08 1.827787e+08 2.198838e+09 4.249722e+08 4.337043e+07 177640.929155
2010 2.064906e+08 7.311396e+07 0.000000 9.989842e+08 2.575443e+08 7.853065e+06 1.308337e+09 2.953387e+09 8.404302e+07 3.209177e+05 ... 1.115183e+08 2.342889e+08 7.332134e+07 4.013044e+06 7.601278e+07 8.285704e+07 9.611656e+08 1.959128e+08 2.310136e+07 404828.231824
2011 2.995594e+08 8.058180e+07 89365.880480 6.689511e+08 2.931591e+08 1.172881e+07 1.389743e+09 2.286539e+09 4.413341e+07 9.645181e+05 ... 1.418956e+08 3.159980e+08 9.911835e+07 9.995669e+06 6.607126e+07 7.912925e+07 7.070474e+08 3.734589e+08 1.363601e+07 586586.948365
2012 7.103770e+08 2.746924e+08 261265.895948 1.753970e+09 8.540407e+08 3.823724e+07 2.790368e+09 9.203666e+08 1.763904e+08 1.271289e+06 ... 2.073111e+08 1.025668e+09 2.641831e+08 1.394356e+07 1.630307e+08 1.683128e+08 1.975553e+09 6.541154e+08 3.980884e+07 739122.050820
2013 4.031560e+08 1.541679e+08 188343.533148 9.230300e+08 4.793908e+08 1.212363e+07 1.428283e+09 2.009915e+09 4.871315e+07 6.250082e+05 ... 1.063123e+08 4.931687e+08 1.798533e+08 8.930246e+06 7.882264e+07 7.858108e+07 4.962937e+09 3.995114e+08 1.836680e+07 418841.640492
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZWE 2011 3.098093e+07 6.060789e+07 9124.882478 1.693777e+08 2.260227e+09 2.817909e+06 3.231869e+08 2.747704e+08 9.872441e+06 3.618592e+06 ... 2.136919e+07 1.144497e+08 5.855387e+07 6.259512e+05 2.652474e+07 4.726781e+07 8.963644e+08 1.874874e+08 6.016952e+06 35481.251510
2012 2.784481e+07 5.549232e+07 856.631347 1.085453e+08 4.929549e+08 7.015919e+06 3.451931e+08 1.461844e+08 1.510668e+07 2.876014e+05 ... 3.162707e+07 2.025934e+08 6.024629e+07 1.412797e+06 2.482629e+07 5.147542e+07 9.468142e+08 1.900125e+08 7.901902e+06 59677.045599
2013 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000
2014 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000
2015 1.458046e+07 7.135043e+07 19192.507302 8.703951e+07 2.299085e+08 4.468801e+06 3.437244e+08 1.401916e+08 2.432263e+07 3.672531e+07 ... 3.233606e+07 3.778318e+07 3.157795e+07 1.061420e+06 1.724449e+07 2.841926e+07 2.137692e+08 1.585549e+08 4.569325e+06 6305.070602

624 rows × 21 columns

Biplots

In [20]:
biplot_PCA(sector_features, 10, 1, 2, obs='reporter.ISO', show_loadings=True)
Out[20]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Animal and Animal Products 0.223097 -0.285686 -0.052218 0.200337 0.280789 -0.075192 -0.159824 0.157986 -0.133621 0.067581
Animal or Vegetable Fats and Oils 0.145805 -0.041750 0.262100 -0.281551 -0.378544 -0.470151 0.409970 0.471670 0.177591 -0.031639
Arms and Ammunition 0.030559 -0.044358 0.524677 -0.654182 0.434354 0.055627 -0.201368 -0.113006 -0.079142 -0.115541
Base Metals 0.235966 -0.206534 -0.261180 -0.147777 -0.019680 -0.153272 0.017922 -0.247084 0.174715 -0.069873
Chemicals and Allied Industries 0.271238 0.107591 -0.117222 -0.089472 -0.096976 -0.130645 -0.071165 -0.219172 -0.198166 0.036517
Footwear and Headgear 0.206357 0.378112 0.015170 -0.019961 -0.001383 -0.029500 -0.042735 0.010544 0.026510 -0.252435
Machinery and Electrical 0.272101 0.160466 0.076229 0.091921 -0.067637 -0.159494 0.022549 -0.135941 -0.157103 -0.115312
Mineral Products 0.238785 0.071925 -0.098357 -0.201011 -0.166463 0.073908 -0.133348 0.121464 -0.241190 0.675042
Miscellaneous Manufactured Articles 0.246123 0.283544 0.122775 0.033592 -0.049743 -0.014818 -0.017630 0.069480 -0.247438 -0.043076
Pearls, Precious Stones and Metals 0.060193 0.251657 -0.373387 -0.153599 0.567863 0.055855 0.647708 0.088517 -0.049224 0.089860
Plastics and Rubbers 0.264093 -0.144764 0.124109 0.236111 0.112342 0.023331 0.062706 0.076398 0.119468 -0.076649
Precision Instruments 0.223754 0.362288 -0.013533 0.023957 -0.084425 -0.090796 -0.080419 -0.186415 -0.262231 -0.078130
Prepared Foodstuffs 0.252722 -0.227483 -0.012139 0.160146 0.177798 -0.024717 -0.088546 0.289203 -0.187744 -0.058613
Pulp of Wood or of Other Fibrous Material 0.272304 -0.112890 -0.010570 0.074023 0.024044 -0.016865 0.003323 -0.123810 0.135842 0.200798
Raw Hides, Skins, Leather, and Furs 0.206454 0.084348 0.178427 0.121865 -0.101317 0.619649 0.102592 0.287981 -0.057302 -0.251296
Stone, Glass, and Ceramics 0.225603 0.111290 0.264504 0.269760 0.061210 -0.018303 0.176815 -0.368090 0.400752 -0.018317
Textiles 0.209936 -0.103101 -0.054321 -0.264334 -0.273480 0.542184 0.126078 -0.076346 0.216014 0.139611
Transportation 0.238443 -0.119628 0.256225 0.081261 0.184539 -0.059237 0.015553 -0.095856 0.219874 0.334486
Vegetable Products 0.245486 -0.284943 -0.122225 -0.029458 0.071482 -0.036325 -0.074223 0.202113 -0.106308 -0.240274
Wood and Wood Products 0.201178 -0.222154 -0.377723 -0.300290 -0.132165 0.007197 -0.048891 -0.185469 0.074259 -0.351532
Works of Art 0.089055 0.388758 -0.239308 -0.078967 0.157495 -0.041382 -0.486227 0.369405 0.561157 0.018339

treemap-sectors Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF

In [22]:
biplot_PCA(sector_features, 10, 5, 6, obs='reporter.ISO')

Explained variance

In [23]:
scree_plot(sector_features, show_explained_var=True)
Out[23]:
PC var
0 1 0.524988
1 2 0.121415
2 3 0.054159
3 4 0.049898
4 5 0.042107
5 6 0.041007
6 7 0.036014
7 8 0.028847
8 9 0.024146
9 10 0.018279
10 11 0.012558
11 12 0.009987
12 13 0.007011
13 14 0.006890
14 15 0.005213
15 16 0.004848
16 17 0.004001
17 18 0.002819
18 19 0.002489
19 20 0.001800
20 21 0.001524

Variance-stabilizing and normalizing transformations

In [24]:
# Plot distribution of illicit flow in each feature (i.e. sector)
fig, axes = joypy.joyplot(sector_features, colormap=plt.cm.viridis, figsize=(8,8),
                          title='Distribution of mis-invoicing across sectors');
In [28]:
sector_features_yeo = pd.DataFrame(sector_features_yeo,
                                   index=sector_features.index,
                                   columns=sector_features.columns)
fig, axes = joypy.joyplot(sector_features_yeo, colormap=plt.cm.viridis, figsize=(8,8),
                          title='Distribution of mis-invoicing across sectors (Yeo–Johnson transformation)');
In [30]:
biplot_PCA(sector_features_yeo, 10, 1, 2, obs='reporter.ISO')

Country features

In [32]:
country_features = create_features(IFF_Sector_Imp, 'Imp_IFF_hi',
                                   features='reporter.ISO', obs='section')
country_features
Out[32]:
reporter.ISO AGO BDI BEN BFA BWA CAF CIV CMR COG COM ... STP SWZ SYC TGO TUN TZA UGA ZAF ZMB ZWE
section year
Animal and Animal Products 2000 0.000000 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.00000 0.0 ... 0.000000 0.000000 0.000000e+00 4.044600e+06 1.572640e+07 1.060197e+06 2.902091e+05 2.341209e+07 0.000000e+00 0.000000e+00
2001 0.000000 0.000000 9.464068e+06 2.013378e+06 4.018742e+06 351958.882652 7.103384e+07 2.552567e+07 0.00000 0.0 ... 0.000000 0.000000 3.439935e+07 1.087398e+07 0.000000e+00 0.000000e+00 2.522664e+05 1.854458e+07 1.832385e+06 0.000000e+00
2002 0.000000 288779.572593 1.458450e+07 7.756380e+05 3.524820e+06 213560.077651 7.767896e+07 2.907597e+07 0.00000 0.0 ... 0.000000 0.000000 0.000000e+00 6.574580e+06 9.664435e+06 0.000000e+00 6.944404e+05 1.452489e+07 4.354400e+05 7.346496e+06
2003 0.000000 217920.960357 2.297721e+07 4.533317e+05 0.000000e+00 0.000000 1.286866e+08 3.485783e+07 0.00000 0.0 ... 0.000000 0.000000 0.000000e+00 6.290272e+06 1.732785e+07 1.294665e+06 1.358896e+06 2.153838e+07 1.422125e+06 0.000000e+00
2004 0.000000 0.000000 2.742553e+07 3.429694e+05 5.275531e+04 0.000000 9.283012e+07 4.940023e+07 0.00000 0.0 ... 0.000000 59632.763605 0.000000e+00 3.901862e+06 3.574647e+07 0.000000e+00 8.053124e+05 2.868016e+07 2.791570e+06 0.000000e+00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Works of Art 2012 739122.050820 0.000000 0.000000e+00 1.054676e+04 2.623004e+05 0.000000 1.820478e+04 3.875133e+05 0.00000 0.0 ... 0.000000 0.000000 0.000000e+00 1.125696e+03 0.000000e+00 7.252236e+04 1.388514e+04 1.371564e+07 3.453691e+04 5.967705e+04
2013 418841.640492 0.000000 0.000000e+00 0.000000e+00 2.282032e+04 0.000000 1.287629e+04 2.104949e+04 0.00000 0.0 ... 9584.734907 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 5.191721e+05 2.986963e+04 1.006657e+07 1.017682e+05 0.000000e+00
2014 390140.730849 0.000000 0.000000e+00 0.000000e+00 9.331590e+04 0.000000 1.017252e+06 1.826816e+05 8003.37246 0.0 ... 20941.755221 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 6.731774e+04 2.129568e+04 2.543377e+07 0.000000e+00 0.000000e+00
2015 0.000000 0.000000 0.000000e+00 0.000000e+00 5.977903e+04 0.000000 0.000000e+00 0.000000e+00 0.00000 0.0 ... 4447.368931 0.000000 5.857979e+05 0.000000e+00 3.088086e+03 5.385830e+05 1.193596e+04 6.697451e+06 5.843484e+04 6.305071e+03
2016 0.000000 0.000000 0.000000e+00 0.000000e+00 4.195233e+04 0.000000 0.000000e+00 0.000000e+00 0.00000 0.0 ... 1327.652396 0.000000 4.463501e+05 0.000000e+00 0.000000e+00 3.443084e+04 2.305443e+03 1.295357e+07 0.000000e+00 0.000000e+00

357 rows × 46 columns

Biplots

In [33]:
biplot_PCA(country_features, 10, 1, 2, obs='section', show_loadings=True)
Out[33]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
AGO 0.172866 5.695754e-02 1.915290e-01 -6.571267e-02 -1.498597e-02 2.683983e-01 -6.131957e-02 -9.901379e-02 -1.352142e-02 1.960619e-01
BDI 0.159396 1.399172e-02 1.100982e-01 1.072287e-01 -5.780584e-02 -3.006342e-02 1.311807e-01 -6.306715e-02 8.380471e-02 -1.595603e-01
BEN 0.166261 2.839527e-01 -4.853377e-02 -1.227506e-01 -8.929173e-02 2.840864e-02 -5.069704e-02 -1.050768e-01 1.115173e-01 -1.479526e-01
BFA 0.200912 1.961349e-01 -1.479219e-02 -3.284903e-02 -7.687652e-03 1.947496e-01 -8.961797e-03 -9.812748e-02 5.298346e-02 -6.100222e-02
BWA 0.058704 5.101605e-04 1.048248e-01 1.176491e-01 1.116177e-01 4.542630e-02 5.291913e-02 1.702242e-01 1.074927e-01 6.990029e-01
CAF 0.136213 -2.941790e-01 7.144585e-02 1.466869e-01 1.005191e-01 -8.211151e-04 1.767076e-02 -9.841998e-02 4.459720e-02 -6.455230e-02
CIV 0.170157 -4.190252e-02 -1.992766e-01 -1.178971e-01 -8.139640e-02 2.099028e-01 1.481461e-01 -9.124417e-02 9.430036e-02 -2.959315e-02
CMR 0.199563 6.685309e-03 2.723196e-02 -1.591573e-01 -1.158744e-01 4.651018e-02 4.997217e-02 -4.900403e-02 6.736277e-02 -4.601119e-02
COG 0.092230 -1.314851e-01 3.146417e-01 -1.361157e-01 -2.737181e-01 2.682128e-01 -4.522047e-02 1.802042e-02 4.328099e-02 3.111778e-02
COM 0.096409 -2.063490e-01 2.921122e-01 -2.226963e-01 -7.053853e-02 -1.456883e-01 8.379586e-02 -6.576844e-02 9.280223e-02 -1.472937e-01
CPV 0.162562 -1.932140e-01 3.911200e-02 -6.076881e-02 6.992487e-02 3.542767e-02 2.890140e-01 -1.259045e-01 5.085305e-02 -4.426356e-02
DZA 0.097941 4.125012e-02 -3.694645e-02 4.185959e-01 -2.291923e-01 -2.224267e-01 -7.905464e-02 8.880404e-02 -9.011562e-02 -5.841375e-02
EGY 0.175564 1.852732e-01 9.778806e-02 1.870187e-01 8.946189e-03 -2.924499e-03 1.181851e-02 -4.328654e-02 4.339753e-03 -1.929171e-01
ETH 0.121994 -1.897909e-01 7.695418e-02 3.167105e-01 2.484336e-01 -1.802844e-01 -7.872203e-02 -9.597359e-02 -2.026803e-02 -1.094738e-01
GAB 0.042052 -1.989049e-01 -3.375917e-01 1.247060e-02 -2.588424e-01 -4.558121e-02 8.313195e-02 9.621236e-02 -7.997286e-02 -1.514322e-01
GHA 0.107784 -1.996829e-01 -6.503997e-02 6.007675e-02 -3.272539e-01 2.451643e-01 4.926298e-02 2.807364e-01 -3.340735e-01 -5.059900e-03
GIN 0.052400 -1.350973e-01 -3.351174e-01 -8.155761e-02 -4.401417e-02 -4.334381e-02 -3.444465e-01 -1.967460e-02 1.193137e-01 -2.723141e-02
GMB 0.162689 8.598690e-02 -9.929674e-02 -2.316451e-01 2.262847e-01 1.330992e-01 2.662606e-01 1.742886e-01 -9.401992e-02 -6.012329e-02
GNB 0.000016 -1.996604e-02 -8.847847e-02 -2.131943e-02 6.445089e-02 9.726163e-02 -2.861441e-01 3.631337e-01 5.592464e-01 -2.128313e-01
KEN 0.117135 -1.733487e-01 -2.826062e-01 -1.414929e-01 -4.852900e-02 1.122855e-01 -1.773947e-01 -2.061400e-01 -2.190356e-01 1.721080e-01
LBY 0.000000 0.000000e+00 -0.000000e+00 1.654361e-24 1.355253e-20 0.000000e+00 0.000000e+00 2.081668e-17 -2.220446e-16 -9.714451e-17
LSO 0.031001 -1.145362e-01 1.172217e-01 -8.783260e-02 -6.316290e-02 1.845925e-01 -1.155909e-01 4.273390e-01 1.846813e-01 -2.597209e-02
MAR 0.196450 -1.442017e-03 -1.407898e-01 -1.151253e-01 4.096951e-02 -2.501230e-01 -1.218791e-01 -1.159986e-01 -2.772756e-02 9.286997e-02
MDG 0.187159 1.768800e-01 -7.603241e-02 1.954600e-02 3.882583e-02 -6.632947e-02 1.303028e-02 1.697959e-01 3.997142e-02 -1.389174e-01
MLI 0.160911 -1.212446e-01 -4.690084e-02 -1.486856e-01 8.742975e-02 -2.762972e-01 8.729569e-02 2.434701e-01 -8.316575e-02 3.368757e-02
MOZ 0.147230 1.608981e-01 9.473738e-02 -1.983947e-02 -2.406356e-01 -2.654582e-01 -2.507807e-03 5.991142e-02 9.261240e-02 1.184466e-01
MRT 0.142351 2.001775e-01 7.014463e-02 -1.332407e-01 -2.101042e-01 -3.003770e-01 -1.237907e-01 -9.258926e-02 7.149717e-02 1.249696e-01
MUS 0.183529 7.441746e-02 -1.253785e-01 7.372575e-02 1.275615e-01 4.308231e-02 8.698864e-02 1.594759e-01 3.117765e-02 8.570793e-02
MWI 0.196370 1.208545e-01 -1.804584e-02 -1.462830e-01 2.430288e-02 7.000297e-02 -7.395843e-02 -3.780468e-02 -6.553461e-02 -1.841761e-02
NAM 0.173960 1.201903e-01 5.978034e-02 9.556719e-02 2.108371e-01 -2.454492e-02 2.251590e-01 1.968563e-01 -5.019903e-04 1.281223e-01
NER 0.191899 -1.071062e-01 1.036822e-01 -3.490165e-02 8.799988e-03 4.101665e-02 -1.036213e-01 -1.501394e-01 5.717845e-02 7.629436e-02
NGA 0.153000 -8.916565e-02 2.197113e-01 2.082393e-02 -2.395585e-01 -1.659439e-01 -1.019044e-01 1.527359e-01 -3.809505e-02 8.557721e-02
RWA 0.186734 -3.747262e-02 1.608489e-01 2.275355e-01 3.285765e-02 9.220145e-02 -1.415166e-01 -1.482959e-01 -3.582179e-03 2.383508e-02
SDN 0.122997 -2.804786e-01 1.221158e-01 -1.345090e-01 -5.996290e-02 -1.354197e-01 -2.219003e-02 1.378054e-01 -7.715869e-02 7.151525e-04
SEN 0.198048 -3.415201e-02 -1.017232e-01 -1.030800e-01 -1.082653e-01 -2.190300e-02 -1.642827e-02 -9.561341e-02 -9.388535e-02 5.110459e-02
SLE 0.000000 -2.281917e-34 3.682813e-32 6.034292e-30 -1.730867e-25 -3.631998e-25 -1.524001e-23 1.682666e-21 -1.953635e-20 -7.252720e-21
STP 0.093915 2.692090e-01 7.292042e-02 2.717052e-01 -2.160666e-01 2.404257e-01 -1.858177e-02 -4.529530e-03 -4.636809e-02 -6.325914e-02
SWZ 0.004459 -9.037762e-02 -2.172397e-01 4.529566e-02 -1.475804e-01 -4.240062e-03 1.559841e-01 -2.357850e-01 5.362103e-01 2.615336e-01
SYC 0.012375 -7.886680e-02 -1.900444e-01 2.392704e-01 -2.750048e-01 -4.851200e-02 5.049505e-01 2.850039e-02 1.579691e-01 1.255474e-02
TGO 0.184941 8.829184e-02 -4.195564e-02 1.516059e-02 -3.407124e-02 -7.070676e-02 -2.815878e-02 -5.328573e-03 -7.180032e-03 -5.552878e-02
TUN 0.189500 -1.490447e-01 -5.294429e-02 1.368356e-01 1.828756e-01 1.829014e-02 -1.057055e-01 1.539954e-01 1.181978e-02 -9.518775e-04
TZA 0.198557 1.129045e-01 -8.978178e-02 -1.479353e-01 1.914884e-01 -8.737537e-02 1.348190e-01 8.895631e-02 -7.496216e-02 2.430973e-02
UGA 0.220482 1.310309e-01 -7.584076e-02 3.261941e-02 7.776106e-02 2.816737e-02 -2.272271e-02 3.592921e-03 -2.836216e-02 -7.275351e-02
ZAF 0.190412 -7.411430e-02 -1.422615e-01 1.624323e-01 5.822959e-02 -5.136024e-02 -2.069728e-01 5.163535e-03 -1.084788e-02 1.159689e-01
ZMB 0.150023 -1.782886e-01 -7.122649e-02 2.088026e-01 1.466777e-01 2.720530e-01 -5.921551e-02 -1.225385e-01 6.453899e-03 -3.512622e-02
ZWE 0.098964 -1.998928e-01 2.092424e-01 -4.305728e-02 1.206815e-01 -8.504574e-02 1.333520e-01 -1.496703e-01 1.545886e-01 -1.980936e-01

percent-trade Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF) percent-GDP Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF)

Variance-stabilizing and normalizing transformations

In [44]:
biplot_PCA(country_features_log, 10, 1, 2, obs='section')
In [45]:
biplot_PCA(country_features_yeo, 10, 1, 2, obs='section')

PCA on bilateral trade matrix

In [46]:
partner_features = create_features(IFF_Dest_Imp, 'Imp_IFF_hi', 
                                   features='partner.ISO', obs='reporter.ISO')
partner_features
Out[46]:
partner.ISO AGO ALB AND ARE ARG ARM ATG AUS AUT AZE ... URY USA VCT VEN VNM VUT YEM ZAF ZMB ZWE
reporter.ISO year
AGO 2009 0.0 0.0 0.0 0.000000e+00 6.014820e+07 0.0 0 6.980492e+06 3.119279e+05 0.000000 ... 2.361993e+06 9.270820e+08 0 0.0 4.734647e+07 0 0.000000 1.016827e+09 1.060946e+05 309894.487818
2010 0.0 0.0 0.0 0.000000e+00 2.963158e+07 0.0 0 2.971282e+06 5.626929e+06 18935.636990 ... 1.644130e+06 8.268511e+08 0 0.0 2.970677e+07 0 0.000000 3.315930e+08 2.665205e+04 420951.156359
2011 0.0 0.0 0.0 0.000000e+00 5.044258e+07 0.0 0 9.077825e+06 2.409418e+06 20266.842796 ... 1.016561e+06 8.601995e+08 0 0.0 2.451965e+07 0 0.000000 3.281534e+08 4.525653e+05 0.000000
2012 0.0 0.0 0.0 0.000000e+00 2.055669e+08 0.0 0 7.483480e+06 1.354856e+07 22457.788787 ... 1.582490e+06 1.441162e+09 0 0.0 5.942382e+07 0 354866.063494 7.944924e+08 1.180964e+03 0.000000
2013 0.0 0.0 0.0 0.000000e+00 5.839522e+07 0.0 0 1.569576e+07 8.770795e+06 1734.698687 ... 3.447835e+06 7.139003e+08 0 0.0 3.931132e+07 0 0.000000 4.272753e+08 2.855892e+05 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZWE 2011 0.0 0.0 0.0 5.290029e+07 2.310119e+07 0.0 0 2.325259e+06 3.171896e+05 0.000000 ... 2.721346e+04 6.462989e+08 0 0.0 5.288219e+06 0 0.000000 3.088036e+09 5.016118e+07 0.000000
2012 0.0 0.0 0.0 5.923175e+07 1.011059e+07 0.0 0 1.664879e+06 1.220841e+06 0.000000 ... 0.000000e+00 6.069423e+08 0 0.0 4.745996e+06 0 0.000000 1.242790e+09 1.704507e+08 0.000000
2013 0.0 0.0 0.0 0.000000e+00 0.000000e+00 0.0 0 0.000000e+00 0.000000e+00 0.000000 ... 0.000000e+00 0.000000e+00 0 0.0 0.000000e+00 0 0.000000 0.000000e+00 0.000000e+00 0.000000
2014 0.0 0.0 0.0 0.000000e+00 0.000000e+00 0.0 0 0.000000e+00 0.000000e+00 0.000000 ... 0.000000e+00 0.000000e+00 0 0.0 0.000000e+00 0 0.000000 0.000000e+00 0.000000e+00 0.000000
2015 0.0 0.0 0.0 5.042948e+07 0.000000e+00 0.0 0 5.764013e+05 6.721992e+05 0.000000 ... 9.012925e+04 4.109564e+07 0 0.0 3.093477e+03 0 0.000000 6.474374e+08 0.000000e+00 0.000000

624 rows × 167 columns

Biplots

In [47]:
biplot_PCA(partner_features, 10, 1, 2, show_loadings=True)
Out[47]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
AGO 0.064538 -0.127775 -0.044675 -0.081190 0.015199 -0.030508 -0.020684 -0.075254 0.026087 0.018444
ALB 0.044858 0.100240 -0.085332 0.010378 0.222166 -0.056160 0.016405 -0.068677 0.098802 -0.006931
AND 0.002434 0.006593 -0.006567 0.005445 -0.015707 0.021064 -0.006253 -0.004508 0.001468 -0.016877
ARE 0.091433 -0.038099 0.067933 -0.046062 0.123878 -0.061712 -0.030800 0.190913 -0.147428 0.060691
ARG 0.129513 0.139207 -0.082968 0.019367 -0.043149 0.085106 -0.014389 0.038525 0.024086 0.000942
... ... ... ... ... ... ... ... ... ... ...
VUT 0.000000 0.000000 0.000000 0.000000 -0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
YEM 0.040005 -0.060918 0.007451 0.143530 0.037673 0.037506 0.085598 0.261330 0.309931 -0.017295
ZAF 0.008847 0.007834 0.091261 -0.022447 0.026785 0.001632 -0.043753 0.104236 -0.037778 0.124183
ZMB 0.040152 0.020620 -0.037492 0.009481 0.093282 -0.176393 -0.033166 0.069316 -0.109787 0.018559
ZWE 0.058102 -0.112442 0.020694 0.215291 0.042842 0.010566 0.044425 0.149634 0.182897 0.017658

167 rows × 10 columns

top-destinations Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF

Intra-African illicit financial flows

In [61]:
partner_features_AFR = create_features(IFF_Dest_Imp_AFR, 'Imp_IFF_hi', 
                                       features='partner.ISO', obs='reporter.ISO')
partner_features_AFR
Out[61]:
partner.ISO AGO BDI BEN BFA BWA CAF CIV CMR COG COM ... STP SWZ SYC TGO TUN TZA UGA ZAF ZMB ZWE
reporter.ISO year
AGO 2009 0.0 0.0 0.0 0.0 0.000000e+00 0.0 7.356525e+05 0.000000 0.000000e+00 0.0 ... 0.000000 0.0 0.0 0.000000 1.628157e+06 0.000000e+00 0.0 1.016827e+09 1.060946e+05 309894.487818
2010 0.0 0.0 0.0 0.0 2.025235e+05 0.0 5.577699e+07 53088.378146 1.402793e+08 0.0 ... 0.000000 0.0 0.0 252999.300576 1.137066e+08 5.529461e+03 0.0 3.315930e+08 2.665205e+04 420951.156359
2011 0.0 0.0 0.0 0.0 5.501246e+03 0.0 6.262938e+06 257786.746458 3.466061e+07 0.0 ... 0.000000 0.0 0.0 0.000000 4.623718e+07 1.745372e+06 0.0 3.281534e+08 4.525653e+05 0.000000
2012 0.0 0.0 0.0 0.0 0.000000e+00 0.0 1.425377e+07 222663.224270 2.587614e+07 0.0 ... 8435.069683 0.0 0.0 0.000000 1.262796e+07 1.253616e+05 0.0 7.944924e+08 1.180964e+03 0.000000
2013 0.0 0.0 0.0 0.0 2.945193e+05 0.0 2.125395e+06 0.000000 4.357466e+06 0.0 ... 957.605075 0.0 0.0 0.000000 4.438123e+06 1.035603e+05 0.0 4.272753e+08 2.855892e+05 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZWE 2011 0.0 0.0 0.0 0.0 8.741233e+07 0.0 0.000000e+00 0.000000 0.000000e+00 0.0 ... 0.000000 0.0 0.0 0.000000 0.000000e+00 1.295280e+06 0.0 3.088036e+09 5.016118e+07 0.000000
2012 0.0 0.0 0.0 0.0 3.187848e+07 0.0 0.000000e+00 0.000000 0.000000e+00 0.0 ... 0.000000 0.0 0.0 0.000000 0.000000e+00 2.729399e+06 0.0 1.242790e+09 1.704507e+08 0.000000
2013 0.0 0.0 0.0 0.0 0.000000e+00 0.0 0.000000e+00 0.000000 0.000000e+00 0.0 ... 0.000000 0.0 0.0 0.000000 0.000000e+00 0.000000e+00 0.0 0.000000e+00 0.000000e+00 0.000000
2014 0.0 0.0 0.0 0.0 0.000000e+00 0.0 0.000000e+00 0.000000 0.000000e+00 0.0 ... 0.000000 0.0 0.0 0.000000 0.000000e+00 0.000000e+00 0.0 0.000000e+00 0.000000e+00 0.000000
2015 0.0 0.0 0.0 0.0 8.142369e+06 0.0 0.000000e+00 0.000000 0.000000e+00 0.0 ... 0.000000 0.0 0.0 0.000000 0.000000e+00 8.238550e+06 0.0 6.474374e+08 0.000000e+00 0.000000

604 rows × 46 columns

In [65]:
biplot_PCA(partner_features_AFR, partner_features_AFR.shape[1], 1, 2)

Removing outliers

In [67]:
partner_features_AFR_noout = partner_features_AFR[~outlying]
In [68]:
biplot_PCA(partner_features_AFR_noout, 46, 1, 2)

Coloring by class label

In [72]:
biplot_PCA_classes(sector_features, 21, 1, 2, classes='Income group (World Bank)')
In [73]:
biplot_PCA_classes(partner_features, 46, 1, 2)
In [77]:
biplot_PCA_classes(partner_features_AFR_noout, 46, 1, 2)

Clustering

Agglomerative clustering

In [80]:
# Perform agglomerative clustering
from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering(n_clusters=5, linkage='ward').fit(X)
plt.scatter(X[:, 0], X[:, 1], c=clustering.labels_);
plt.title('Agglomerative clustering using Ward linkage');

Hierarchical clustering and corresponding heatmap

In [88]:
# Plot heatmap and corresponding dendograms
g = sns.clustermap(data_scaled,
                   method='ward',
                   cmap='bone_r',
                   row_colors=region_color,
                  );
g.fig.suptitle('Hierarchical clustering with Ward linkage', y=1);

Network analysis

In [91]:
def create_graph_data(year='2016', threshold=True, threshold_var='GDP', flow_threshold=10000):
    """
    Returns data-set to be used to generate a directed graph.
    
    year: as string, specify which year to use for the network analysis (between 2000-2016)
    threshold: as boolean, indicate whether to restrict reporting countries to conduits
    threshold_var: as string, indicate which variable to use when determining which 
        reporting countries are conduits ('GDP' or 'trade')
    flow_threshold: as numeric, specify the cut-off under which to ignore the dollar values
        of mis-invoiced imports
    """
    flow_data = IFF_Dest.reset_index().query('year == @year')
    flow_data = flow_data.loc[flow_data['Imp_IFF_hi'].notnull(), :]
    flow_data = flow_data.query('Imp_IFF_hi >= @flow_threshold')
    
    if threshold:
        conduits = 'conduits_' + threshold_var
        flow_data = flow_data[flow_data['reporter.ISO'].isin(eval(conduits).index)]
    return flow_data

Thresholding

In [92]:
# Import data on mis-invoiced trade aggregated by destination and sector for each African country
IFF_Year = pd.read_csv('Data/GER_Orig_Year_Africa.csv')

# Restrict data to 2016 as an illustrative example
IFF_Year = IFF_Year.query('year == "2016"')

# Plot distribution of proportional mis-invoiced imports for each African country
sns.distplot(IFF_Year['Tot_IFF_hi_GDP'].apply(lambda x: x*100), 
             kde=True, label='% GDP', bins=10);
sns.distplot(IFF_Year['Tot_IFF_hi_trade'].apply(lambda x: x*100),
             kde=True, label='% trade', bins=10);
plt.legend()
plt.title('Distribution of IFF in Africa in 2016')
plt.xlabel('Mis-invoiced trade for African countries');
In [93]:
print('Mean outflow as proportion of GDP:',IFF_Year['Tot_IFF_hi_GDP'].mean(),
      '\nMean outflow as proportion of trade', IFF_Year['Tot_IFF_hi_trade'].var())
Mean outflow as proportion of GDP: 0.06371070220761896 
Mean outflow as proportion of trade 0.006736362224603777
In [94]:
thresh_GDP = 0.1
thresh_trade = 0.17
conduits_GDP = IFF_Year.query('Tot_IFF_hi_GDP >= @thresh_GDP').set_index('reporter.ISO')
conduits_trade = IFF_Year.query('Tot_IFF_hi_trade >= @thresh_trade').set_index('reporter.ISO')
In [95]:
conduits_GDP[['reporter', 'year', 'Tot_IFF_hi_GDP', 'GDP']]
Out[95]:
reporter year Tot_IFF_hi_GDP GDP
reporter.ISO
MLI Mali 2016 0.102023 1.401079e+10
MOZ Mozambique 2016 0.141781 1.098136e+10
SYC Seychelles 2016 0.216881 1.427525e+09
TUN Tunisia 2016 0.100490 4.180838e+10
In [96]:
conduits_trade[['reporter', 'year', 'Tot_IFF_hi_trade', 'Total_value']]
Out[96]:
reporter year Tot_IFF_hi_trade Total_value
reporter.ISO
DZA Algeria 2016 0.175611 77082785056
BDI Burundi 2016 0.312707 754378325
MLI Mali 2016 0.210566 6788467564
MOZ Mozambique 2016 0.180047 8647421818
NGA Nigeria 2016 0.179441 68077346474
STP São Tomé and Príncipe 2016 0.204593 149825956
UGA Uganda 2016 0.233074 7802352503
In [97]:
# Import data where value of illicit flow is standardized for partners
IFF_std = pd.read_csv('Data/GER_Orig_Dest_Year_std.csv')
IFF_std = IFF_std.query('year == "2016"')

# Merge in with flow data
flow_data = pd.merge(left=create_graph_data('2016', threshold_var='GDP'),
                     right=IFF_std[['reporter.ISO', 'partner.ISO', 'pImp_IFF_hi_GDP']],
                     on=['reporter.ISO', 'partner.ISO'])

# Plot distribution of IFF in partner countries
sns.distplot(flow_data['pImp_IFF_hi_GDP'], kde=True);
plt.title('Distribution of IFF in partner countries in 2016 (dyad-level)')
plt.xlabel('Mis-invoiced trade as proportion of GDP in partner countries');
In [98]:
def threshold_partner_IFF(flow_data, year='2016', partner_threshold=0.0001):
    """
    Filters bilateral flow data-set to minimum level of illicit flow relative to partner GDP.
    
    flow_data: as Pandas dataframe, name of data-set which contains bilateral flow data
        (in wide format)
    year: as string, specify which year to use for the network analysis (between 2000-2016)
    partner_threshold: as numeric, specify the cut-off for the minimum proportion of partner
        GDP that a bilateral flow must represent in order to be included
    """  
    IFF_std = pd.read_csv('Data/GER_Orig_Dest_Year_std.csv')
    IFF_std = IFF_std.query('year == @year')
    flow_data = pd.merge(left=flow_data,
                     right=IFF_std[['reporter.ISO', 'partner.ISO', 'pImp_IFF_hi_GDP']],
                     on=['reporter.ISO', 'partner.ISO'])
    flow_data = flow_data.query('pImp_IFF_hi_GDP >= @partner_threshold')
    return flow_data

Directed graph for GDP

In [99]:
# Graph data for conduits relative to GDP
flow_data_GDP = create_graph_data('2016', threshold_var='GDP')
flow_data_GDP = threshold_partner_IFF(flow_data_GDP)
In [100]:
# Create directed graph
graph = nx.from_pandas_edgelist(flow_data_GDP,
                                'reporter.ISO',
                                'partner.ISO', 
                                'Imp_IFF_hi',
                                create_using = nx.DiGraph())
In [104]:
def set_graph_attributes(graph):
    """
    Sets node attributes and create auxiliary variables to be used in graph visualization.
    
    Returns:
    col: list of values to color nodes according to (node GDP per capita)
    edge_col: array of values to color edges according to (logged flow between nodes)
    sizes: list of values to size nodes (proportional to degree)
    labels: dict of node labels (where a node is labelled if outdegree is at least 1)
    """
    # Create dictionary of GDP per capita for each country
    GDP_attr = covariates.loc[:, ['gdp-pc']]
    GDP_attr = GDP_attr.to_dict('index')
    # Set GDP per capita as a node attribute
    nx.set_node_attributes(graph, GDP_attr)

    # Create list of colors for nodes in the graph
    col = [nx.get_node_attributes(graph, 'gdp-pc')[n] for n in graph.nodes]

    # Create list of colors for edges in the graph
    edge_col = [nx.get_edge_attributes(graph, 'Imp_IFF_hi')[e] for e in graph.edges]
    edge_col = np.log(edge_col)

    # Extract outdegree (the number of edges coming out of nodes) and degree
    outdeg = graph.out_degree
    deg = graph.degree
    # Size of nodes will be proportional to their degree
    sizes = [10 * deg[c] for c in graph.nodes]
    # Label the countries if their outdegree is at least 1, i.e. if they are the reporting African countries
    labels = {c: c if outdeg[c] >= 1 else ''
              for c in graph.nodes}
    return col, edge_col, outdeg, deg, sizes, labels
In [105]:
# Generate graph attributes and auxiliary variables
col, edge_col, outdeg, deg, sizes, labels = set_graph_attributes(graph)
In [106]:
# Draw directed graph and color nodes by GDP per capita
plt.figure(figsize = (10,8))

pos = nx.spring_layout(graph)
nodes = nx.draw_networkx_nodes(graph, pos,
                               node_color=col, cmap=plt.cm.spring_r)
edges = nx.draw_networkx_edges(graph, pos,
                               edge_color=edge_col, 
                               edge_cmap=plt.cm.get_cmap('RdPu'))
thous_fmt = FuncFormatter(lambda x, p: format(int(x), ','))
nx.draw_networkx_labels(graph, pos, font_size=10)
clb = plt.colorbar(nodes, format=thous_fmt)
clb.ax.set_title('GDP per capita');
In [107]:
# Create dictionary of longitude and latitude for countries
geo_pos = {country: (v['Longitude'], v['Latitude'])
           for country, v in
           crosswalk.drop_duplicates('ISO3166.3').set_index('ISO3166.3').to_dict('index').items()
}
list(geo_pos.items())[0:10]
Out[107]:
[('ABW', (-69.98267711, 12.52088038)),
 ('AFG', (66.00473366, 33.83523073)),
 ('AGO', (17.53736768, -12.29336054)),
 ('AIA', (-63.06498927, 18.2239595)),
 ('ALA', (19.95328768, 60.21488688)),
 ('ALB', (20.04983396, 41.14244989)),
 ('AND', (1.56054378, 42.54229102)),
 ('ANT', (nan, nan)),
 ('ARE', (54.300167099999996, 23.90528188)),
 ('ARG', (-65.17980692, -35.3813488))]
In [108]:
# Map projection
crs = ccrs.PlateCarree()
fig, ax = plt.subplots(1, 1, figsize=(20, 8), subplot_kw=dict(projection=crs))
ax.coastlines(color='lightgray')
ax.add_feature(cfeature.LAND, facecolor='lightgray')
ax.add_feature(cfeature.BORDERS, edgecolor='white')
# Uncomment for the map to span the whole world
# ax.set_extent([-160, 180, -60, 90], crs=ccrs.PlateCarree())

# Overlay network graph
nx.draw_networkx(graph, ax=ax,
                 pos=geo_pos,
                 font_size=14,
                 alpha=0.7,
                 node_color=col, cmap=plt.cm.spring_r,
                 edge_color=edge_col, edge_cmap=plt.cm.get_cmap('RdPu'),
                 node_size=sizes,
                 labels=labels)

Directed graph for trade

In [109]:
# Graph data for conduits relative to trade
flow_data_trade = create_graph_data('2016', threshold_var='trade')
flow_data_trade = threshold_partner_IFF(flow_data_trade)
In [110]:
# Create directed graph
graph = nx.from_pandas_edgelist(flow_data_trade,
                                'reporter.ISO',
                                'partner.ISO', 
                                'Imp_IFF_hi',
                                create_using = nx.DiGraph())
In [112]:
# Generate graph attributes and auxiliary variables
col, edge_col, outdeg, deg, sizes, labels = set_graph_attributes(graph)
In [113]:
# Draw directed graph and color nodes by GDP per capita
plt.figure(figsize = (10,8))

pos = nx.spring_layout(graph)
nodes = nx.draw_networkx_nodes(graph, pos,
                               node_color=col, cmap=plt.cm.spring_r)
edges = nx.draw_networkx_edges(graph, pos,
                               edge_color=edge_col, 
                               edge_cmap=plt.cm.get_cmap('RdPu'))
thous_fmt = FuncFormatter(lambda x, p: format(int(x), ','))
nx.draw_networkx_labels(graph, pos, font_size=10)
clb = plt.colorbar(nodes, format=thous_fmt)
clb.ax.set_title('GDP per capita');
In [114]:
# Map projection
crs = ccrs.PlateCarree()
fig, ax = plt.subplots(1, 1, figsize=(20, 8), subplot_kw=dict(projection=crs))
ax.coastlines(color='lightgray')
ax.add_feature(cfeature.LAND, facecolor='lightgray')
ax.add_feature(cfeature.BORDERS, edgecolor='white')
# Uncomment for the map to span the whole world
# ax.set_extent([-160, 180, -60, 90], crs=ccrs.PlateCarree())

# Overlay network graph
nx.draw_networkx(graph, ax=ax,
                 pos=geo_pos,
                 font_size=14,
                 alpha=0.7,
                 node_color=col, cmap=plt.cm.spring_r,
                 edge_color=edge_col, edge_cmap=plt.cm.get_cmap('RdPu'),
                 node_size=sizes,
                 labels=labels)

Evolution of the network during 2000-2016

Conclusion

Policy recommendations

  • Target 16.4 of the SDGs: "By 2030, significantly reduce illicit financial and arms flows, strengthen the recovery and return of stolen assets and combat all forms of organized crime"
  • Estimates of the magnitude and severity of the problem have shed a spotlight on the need to combat illicit financial flows
  • Deeper understanding of the origins and destinations is now crucial to guiding targeted interventions to combat illicit flows
  • Strong subregional effects are apparent in illicit trade within the continent. Initiatives to combat illicit flows could be anchored within the Regional Economic Communities of the African Union, such as ECOWAS.

Next steps

I plan to take this project further. Notably, I plan to conduct analysis on the most disaggregated view of the data, that is, for a reporter-partner-year-commodity tuple. By filtering by commodity sector, I will be able to identify the relevant sinks and sources.

Moreover, I am currently exploring spectral clustering. However, spectral clustering algorithms are currently implemented for undirected rather than directed graphs. There are several possible approaches to dealing with clustering on directed graphs. One of them includes a naive approach where direction is ignored, and where the graph is treated as an undirected network.

Finally, I am considering non-negative matrix factorization as an alternative form of dimension reduction.