Unsupervised learning on illicit trade data

Alice Lépissier

📧 alepissier@bren.ucsb.edu

Bren School of Environmental Science and Management

Policy and media attention on illicit financial flows (IFF) has increased, with the recognition that Africa is a net creditor to the world.

NYT-2013 Guardian-2015
New York Times (2013) Guardian (2015)
Guardian-2017 Economist-2019
Guardian (2017) Economist (2019)

What is trade mis-invoicing?

  • The deliberate mis-statement of price or quantity of internationally traded goods in invoices presented to customs
  • Can occur at import or export
  • Can result in an inflow or outflow of money

Motivations for trade mis-invoicing include:

  • Evading tariffs
  • Exploiting subsidy regimes
  • Subverting forex and capital controls
  • Hiding transfers of wealth

Mechanisms of mis-invoicing

conceptual-model

Why does trade mis-invoicing matter?

  • Outflows undermine the fiscal base and public spending
  • Financing gap needed to meet the Sustainable Development Goals (SDGs)
  • Combating trade mis-invoicing is crucial for the mobilization of domestic resources in the continent, and can catalyze sustainable development
governance-loop
Governance loop (credit: William Davis)

Data source

  • Trade mis-invoicing data-set of the United Nations Economic Commission for Africa
  • Panel with $n=6,248,254$ of mis-invoiced trade between 179 reporting jurisdictions and 179 partner countries for years 2000-2016 and disaggregated commodities
  • Citation: LĂ©pissier, Alice, Davis, William, & Ibrahim, Gamal. (2019). Trade Mis-Invoicing Dataset (Version 1). DOI
  • The entire data-set panel_results.Rdata with ~6 million observations is available online.
  • The unit of observation is a reporter-partner-commodity-year tuple, where the data represent the illicit flow embedded in a transaction between a reporter $i$ and a partner $j$ for a commodity $c$ in year $t$.
  • The full panel is ~2GB. This notebook works with smaller summary data-sets generated from the full panel. The code to generate these summary data-sets is available on the repository https://github.com/walice/Trade-IFF by running the script file Compute IFF Estimates.R.

Methodology for calculating mis-invoiced trade

  • We locate mis-invoicing in the discrepancies between reported trade flows and their mirrored statistics
  • But not all observed discrepancies are due to illicit motives!
  • Imports tend to be recorded on the basis of Cost of Insurance and Freight (CIF), while exports are recorded Free On Board (FOB)

Our approach

  1. Estimate the discrepancies between imports and mirror net exports as a function of both licit and illicit predictors
  2. Perform a harmonization procedure in order to generate a reconciled value that represents our best estimate of what the legitimate value of the trade should be on a FOB basis
  3. Calculate the IFF embedded in each transaction as the difference between the observed value and the reconciled value
  • The panel can be aggregated over several dimensions, e.g. partner country, commodity, year. There are two methods to aggregate up the illicit flow:
    • Net basis: simply add up all negative and positive values, so that inflows and outflows cancel each other out
    • Gross Excluding Reversals (GER) basis: ignore all inflows, and sum up the positive values only across trading partners

Zoom in on Africa

  • During 2000-2016, the continent lost on average \$83 billion a year in gross illicit outflows
  • Net cumulative flows during that period were \$362 billion

Africa-map Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF

Goals

This project will apply the following techniques to the data:

  1. Dimension reduction using Principal Components Analysis (PCA)
  2. Clustering
  3. Graph analysis

Data wrangling

  1. Mis-invoiced trade for countries (aggregated data)
    • View 1: unit of observation = reporter-year; features = sectors
    • View 2: unit of observation = sector-year; features = reporters
  2. Bilateral matrix of mis-invoiced trade (dyadic data)
    • Unit of observation = reporter-partner-year

Mis-invoiced trade for countries by sectors

In [7]:
# Extract mis-invoicing in imports
IFF_Sector_Imp = IFF_Sector[['section', 'Imp_IFF_hi']]
IFF_Sector_Imp
Out[7]:
section Imp_IFF_hi
reporter.ISO year
DZA 2001 Animal and Animal Products 1.914633e+08
2001 Pulp of Wood or of Other Fibrous Material 7.436143e+07
2001 Textiles 3.560644e+07
2001 Footwear and Headgear 1.146507e+06
2001 Stone, Glass, and Ceramics 2.935880e+07
... ... ... ...
ZWE 2007 Arms and Ammunition 0.000000e+00
2009 Works of Art 0.000000e+00
2013 Miscellaneous Manufactured Articles 0.000000e+00
2014 Miscellaneous Manufactured Articles 0.000000e+00
2014 Mineral Products 0.000000e+00

11133 rows Ă— 2 columns

Mis-invoiced trade for dyads

In [12]:
# Extract mis-invoicing in imports
IFF_Dest_Imp = IFF_Dest[['partner.ISO', 'Imp_IFF_hi']]
IFF_Dest_Imp
Out[12]:
partner.ISO Imp_IFF_hi
reporter.ISO year
DZA 2001 AND 1.609561e+04
2001 ARG 4.717459e+07
2001 AUS 2.027641e+07
2001 AUT 7.641706e+07
2001 BEL 2.285729e+07
... ... ... ...
ZWE 2014 MDG 0.000000e+00
2014 SGP 0.000000e+00
2014 CHE 0.000000e+00
2014 ARE 0.000000e+00
2015 UGA 0.000000e+00

34757 rows Ă— 2 columns

Metadata for countries

In [14]:
crosswalk = pd.read_excel("Data/crosswalk.xlsx").rename(columns={'Country': 'country'})
crosswalk.head()
Out[14]:
ISO3166.3 ISO3166.2 country UN_Region UN_Region_Code UN_Sub-region UN_Sub-region_Code UN_Intermediate_Region UN_Intermediate_Region_Code UN_M49_Code ... WB_Income_Group_Code WB_Region WB_Lending_Category WB_Other OECD EU28 Arab League Commonwealth Longitude Latitude
0 ABW AW Aruba Americas 19.0 Latin America and the Caribbean 419.0 Caribbean 29.0 533.0 ... HIC Latin America and Caribbean .. NaN 0 0 0 0 -69.982677 12.520880
1 AFG AF Afghanistan Asia 142.0 Southern Asia 34.0 NaN NaN 4.0 ... LIC South Asia IDA HIPC 0 0 0 0 66.004734 33.835231
2 AFG AF Afghanistan, Islamic Republic of Asia 142.0 Southern Asia 34.0 NaN NaN 4.0 ... LIC South Asia IDA HIPC 0 0 0 0 66.004734 33.835231
3 AGO AO Angola Africa 2.0 Sub-Saharan Africa 202.0 Middle Africa 17.0 24.0 ... LMC Sub-Saharan Africa IBRD NaN 0 0 0 0 17.537368 -12.293361
4 AIA AI Anguila Americas 19.0 Latin America and the Caribbean 419.0 Caribbean 29.0 660.0 ... NaN NaN NaN NaN 0 0 0 0 -63.064989 18.223959

5 rows Ă— 25 columns

Auxiliary functions

In [15]:
def create_features(data, values, features, obs):
    """
    Convert data-set in long format to wide and preserve information on year.
    
    data: {IFF_Sector_Imp, IFF_Dest_Imp, IFF_Dest_AFR, ...}, as Pandas dataframe, 
        name of data-set from which to create feature space, must be in long format
    values: {'Imp_IFF_hi', 'Exp_IFF_hi'}, as string, values that data-set will represent
    features: {'reporter.ISO', 'section', 'partner.ISO'}, as string, 
        what to use as the feature space
    obs: {'section', 'reporter.ISO'}, as string, what to use as the observation level
    """
    features_data = data.pivot_table(values=values, 
                                     columns=features, 
                                     index=[obs, 'year'], 
                                     fill_value=0)
    return features_data
In [17]:
def biplot_PCA(features_data, nPC=2, firstPC=1, secondPC=2, obs='reporter.ISO', show_loadings=False):
    """
    Project the data in the 2-dimensional space spanned by 2 principal components
    chosen by the user, along with a bi-plot of the top 3 loadings per PC, and color observations
    by class label.

    Args:
        features_data: as Pandas dataframe, data-set of features
        nPC: number of principal components
        firstPC: integer denoting first principal component to plot in bi-plot
        secondPC: integer denoting second principal component to plot in bi-plot
        obs: string denoting index of class labels (in features_data)
        show_loadings: Boolean indicating whether PCA loadings should be displayed
    Returns:
        plot (interactive)
        pca_loadings (if show_loadings=True)
    """
        
    # Run PCA (standardize data beforehand)
    features_data_std = StandardScaler().fit_transform(features_data)
    pca = PCA(n_components=nPC, random_state=234)
    princ_comp = pca.fit_transform(features_data_std)

    # Extract PCA loadings
    cols = ['PC' + str(c+1) for c in np.arange(nPC)]
    pca_loadings = pd.DataFrame(pca.components_.T, 
                                columns=cols,
                                index=list(features_data.columns))
   
    # Extract PCA scores
    pca_scores = pd.DataFrame(princ_comp, 
                              columns=cols)
    pca_scores[obs] = features_data.reset_index()[obs].values.tolist()
    pca_scores['year'] = features_data.reset_index()['year'].values.tolist()
    
    score_PC1 = princ_comp[:,firstPC-1]
    score_PC2 = princ_comp[:,secondPC-1]
    
    # Generate plot data
    if obs == 'reporter.ISO':
        plot_data = pd.merge(pca_scores, obs_info, on=[obs, 'year'])
        color_obs = 'reporter'
        tooltip_obs = ['reporter', 'year', 'Income group (World Bank)', 'Country status (UN)']
    else:
        plot_data = pca_scores
        color_obs = 'section'
        tooltip_obs = ['section', 'year']

    # Return chosen PCs to plot
    PC1 = 'PC'+str(firstPC)
    PC2 = 'PC'+str(secondPC)

    # Extract top loadings (in absolute value)
    # TO DO: use dict to iterate over
    toploadings_PC1 = pca_loadings.apply(lambda x: abs(x)).sort_values(by=PC1).tail(3)[[PC1, PC2]]
    toploadings_PC2 = pca_loadings.apply(lambda x: abs(x)).sort_values(by=PC2).tail(3)[[PC1, PC2]]

    originsPC1 = pd.DataFrame({'index':toploadings_PC1.index.tolist(), 
                               PC1: np.zeros(3), 
                               PC2: np.zeros(3)})
    originsPC2 = pd.DataFrame({'index':toploadings_PC2.index.tolist(), 
                               PC1: np.zeros(3), 
                               PC2: np.zeros(3)})
    
    toploadings_PC1 = pd.concat([toploadings_PC1.reset_index(), originsPC1], axis=0)
    toploadings_PC2 = pd.concat([toploadings_PC2.reset_index(), originsPC2], axis=0)

    toploadings_PC1[PC1] = toploadings_PC1[PC1]*max(score_PC1)*1.5
    toploadings_PC1[PC2] = toploadings_PC1[PC2]*max(score_PC2)*1.5
    toploadings_PC2[PC1] = toploadings_PC2[PC1]*max(score_PC1)*1.5
    toploadings_PC2[PC2] = toploadings_PC2[PC2]*max(score_PC2)*1.5
    
    # Project top 3 loadings over the space spanned by 2 principal components
    lines = alt.Chart().mark_line().encode()
    for color, i, dataset in zip(['#440154FF', '#21908CFF'], [0,1], [toploadings_PC1, toploadings_PC2]):
        lines[i] = alt.Chart(dataset).mark_line(color=color).encode(
        x= PC1 +':Q',
        y= PC2 +':Q',
        detail='index'
    ).properties(
        width=400,
        height=400
    )
    
    # Add labels to the loadings
    text=alt.Chart().mark_text().encode()
    for color, i, dataset in zip(['#440154FF', '#21908CFF'], [0, 1], [toploadings_PC1[0:3], toploadings_PC2[0:3]]):
        text[i] = alt.Chart(dataset).mark_text(
                align='left',
                baseline='bottom',
                color=color
            ).encode(
                x= PC1 +':Q',
                y= PC2 +':Q',
                text='index'
            )
    
    # Scatter plot colored by observation class label
    points = alt.Chart(plot_data).mark_circle(size=60).encode(
        x=alt.X(PC1, axis=alt.Axis(title='Principal Component ' + str(firstPC))),
        y=alt.X(PC2, axis=alt.Axis(title='Principal Component ' + str(secondPC))),
        color=alt.Color(color_obs, scale=alt.Scale(scheme='category20b'),
                       legend=alt.Legend(orient='right')),
        tooltip=tooltip_obs
    ).interactive()
    
    # Bind it all together
    chart = (points + lines[0] + lines[1] + text[0] + text[1])    
    chart.display()

    if show_loadings:
        return pca_loadings
In [18]:
def scree_plot(features_data, show_explained_var=False):
    """
    Create a cumulative scree splot and (optional) return the explained variance by each component.
    
    features_data: as Pandas dataframe, the data-set on which to run PCA
    show_explained_var: as Boolean, flag for whether to return explained variance
    """
    
    features_data_std = StandardScaler().fit_transform(features_data)
    pca = PCA(n_components=features_data_std.shape[1], random_state=234)
    princ_comp = pca.fit_transform(features_data_std)
    
    explained_var = pd.DataFrame({'PC': np.arange(1,features_data_std.shape[1]+1),
                                  'var': pca.explained_variance_ratio_,
                                  'cumvar': np.cumsum(pca.explained_variance_ratio_)})

    # Adapted from https://altair-viz.github.io/gallery/multiline_tooltip.html
    # Create a selection that chooses the nearest point & selects based on x-value
    nearest = alt.selection(type='single', nearest=True, on='mouseover',
                            fields=['PC'], empty='none')

    # The basic line
    line = alt.Chart(explained_var).mark_line(interpolate='basis', color='#FDE725FF').encode(
        alt.X('PC:Q',
            scale=alt.Scale(domain=(1, len(explained_var))),
            axis=alt.Axis(title='Principal Component')
        ),
        alt.Y('cumvar:Q',
            scale=alt.Scale(domain=(min(explained_var['cumvar']), 1)),
            axis=alt.Axis(title='Cumulative Variance Explained')
        ),
    )

    # Transparent selectors across the chart. This is what tells us
    # the x-value of the cursor
    selectors = alt.Chart(explained_var).mark_point().encode(
        x='PC:Q',
        opacity=alt.value(0),
    ).add_selection(
        nearest
    )

    # Draw points on the line, and highlight based on selection
    points = line.mark_point().encode(
        opacity=alt.condition(nearest, alt.value(1), alt.value(0))
    )

    # Draw text labels near the points, and highlight based on selection
    text = line.mark_text(align='left', dx=5, dy=-5).encode(
        text=alt.condition(nearest, 'cumvar:Q', alt.value(' '))
    )

    # Draw a rule at the location of the selection
    rules = alt.Chart(explained_var).mark_rule(color='gray').encode(
        x='PC:Q',
    ).transform_filter(
        nearest
    )

    # Put the five layers into a chart and bind the data
    out = alt.layer(
        line, selectors, points, rules, text
    ).properties(
        title='Cumulative scree plot',
        width=500, height=300
    )
    
    out.display()
    
    if show_explained_var:
        return explained_var[['PC', 'var']]

PCA on feature space (for individual reporting countries)

Sector features

In [19]:
sector_features = create_features(IFF_Sector_Imp, 'Imp_IFF_hi', 
                                  features='section', obs='reporter.ISO')
sector_features
Out[19]:
section Animal and Animal Products Animal or Vegetable Fats and Oils Arms and Ammunition Base Metals Chemicals and Allied Industries Footwear and Headgear Machinery and Electrical Mineral Products Miscellaneous Manufactured Articles Pearls, Precious Stones and Metals ... Precision Instruments Prepared Foodstuffs Pulp of Wood or of Other Fibrous Material Raw Hides, Skins, Leather, and Furs Stone, Glass, and Ceramics Textiles Transportation Vegetable Products Wood and Wood Products Works of Art
reporter.ISO year
AGO 2009 2.694946e+08 7.652455e+07 67738.261242 1.025920e+09 5.115176e+08 2.920410e+07 1.525512e+09 1.727293e+09 2.472936e+08 5.575633e+05 ... 1.242156e+08 4.781184e+08 1.149992e+08 1.097769e+07 1.349018e+08 1.827787e+08 2.198838e+09 4.249722e+08 4.337043e+07 177640.929155
2010 2.064906e+08 7.311396e+07 0.000000 9.989842e+08 2.575443e+08 7.853065e+06 1.308337e+09 2.953387e+09 8.404302e+07 3.209177e+05 ... 1.115183e+08 2.342889e+08 7.332134e+07 4.013044e+06 7.601278e+07 8.285704e+07 9.611656e+08 1.959128e+08 2.310136e+07 404828.231824
2011 2.995594e+08 8.058180e+07 89365.880480 6.689511e+08 2.931591e+08 1.172881e+07 1.389743e+09 2.286539e+09 4.413341e+07 9.645181e+05 ... 1.418956e+08 3.159980e+08 9.911835e+07 9.995669e+06 6.607126e+07 7.912925e+07 7.070474e+08 3.734589e+08 1.363601e+07 586586.948365
2012 7.103770e+08 2.746924e+08 261265.895948 1.753970e+09 8.540407e+08 3.823724e+07 2.790368e+09 9.203666e+08 1.763904e+08 1.271289e+06 ... 2.073111e+08 1.025668e+09 2.641831e+08 1.394356e+07 1.630307e+08 1.683128e+08 1.975553e+09 6.541154e+08 3.980884e+07 739122.050820
2013 4.031560e+08 1.541679e+08 188343.533148 9.230300e+08 4.793908e+08 1.212363e+07 1.428283e+09 2.009915e+09 4.871315e+07 6.250082e+05 ... 1.063123e+08 4.931687e+08 1.798533e+08 8.930246e+06 7.882264e+07 7.858108e+07 4.962937e+09 3.995114e+08 1.836680e+07 418841.640492
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZWE 2011 3.098093e+07 6.060789e+07 9124.882478 1.693777e+08 2.260227e+09 2.817909e+06 3.231869e+08 2.747704e+08 9.872441e+06 3.618592e+06 ... 2.136919e+07 1.144497e+08 5.855387e+07 6.259512e+05 2.652474e+07 4.726781e+07 8.963644e+08 1.874874e+08 6.016952e+06 35481.251510
2012 2.784481e+07 5.549232e+07 856.631347 1.085453e+08 4.929549e+08 7.015919e+06 3.451931e+08 1.461844e+08 1.510668e+07 2.876014e+05 ... 3.162707e+07 2.025934e+08 6.024629e+07 1.412797e+06 2.482629e+07 5.147542e+07 9.468142e+08 1.900125e+08 7.901902e+06 59677.045599
2013 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000
2014 0.000000e+00 0.000000e+00 0.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000
2015 1.458046e+07 7.135043e+07 19192.507302 8.703951e+07 2.299085e+08 4.468801e+06 3.437244e+08 1.401916e+08 2.432263e+07 3.672531e+07 ... 3.233606e+07 3.778318e+07 3.157795e+07 1.061420e+06 1.724449e+07 2.841926e+07 2.137692e+08 1.585549e+08 4.569325e+06 6305.070602

624 rows Ă— 21 columns

Biplots

In [20]:
biplot_PCA(sector_features, 10, 1, 2, obs='reporter.ISO', show_loadings=True)
Out[20]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Animal and Animal Products 0.223097 -0.285686 -0.052218 0.200337 0.280789 -0.075192 -0.159824 0.157986 -0.133621 0.067581
Animal or Vegetable Fats and Oils 0.145805 -0.041750 0.262100 -0.281551 -0.378544 -0.470151 0.409970 0.471670 0.177591 -0.031639
Arms and Ammunition 0.030559 -0.044358 0.524677 -0.654182 0.434354 0.055627 -0.201368 -0.113006 -0.079142 -0.115541
Base Metals 0.235966 -0.206534 -0.261180 -0.147777 -0.019680 -0.153272 0.017922 -0.247084 0.174715 -0.069873
Chemicals and Allied Industries 0.271238 0.107591 -0.117222 -0.089472 -0.096976 -0.130645 -0.071165 -0.219172 -0.198166 0.036517
Footwear and Headgear 0.206357 0.378112 0.015170 -0.019961 -0.001383 -0.029500 -0.042735 0.010544 0.026510 -0.252435
Machinery and Electrical 0.272101 0.160466 0.076229 0.091921 -0.067637 -0.159494 0.022549 -0.135941 -0.157103 -0.115312
Mineral Products 0.238785 0.071925 -0.098357 -0.201011 -0.166463 0.073908 -0.133348 0.121464 -0.241190 0.675042
Miscellaneous Manufactured Articles 0.246123 0.283544 0.122775 0.033592 -0.049743 -0.014818 -0.017630 0.069480 -0.247438 -0.043076
Pearls, Precious Stones and Metals 0.060193 0.251657 -0.373387 -0.153599 0.567863 0.055855 0.647708 0.088517 -0.049224 0.089860
Plastics and Rubbers 0.264093 -0.144764 0.124109 0.236111 0.112342 0.023331 0.062706 0.076398 0.119468 -0.076649
Precision Instruments 0.223754 0.362288 -0.013533 0.023957 -0.084425 -0.090796 -0.080419 -0.186415 -0.262231 -0.078130
Prepared Foodstuffs 0.252722 -0.227483 -0.012139 0.160146 0.177798 -0.024717 -0.088546 0.289203 -0.187744 -0.058613
Pulp of Wood or of Other Fibrous Material 0.272304 -0.112890 -0.010570 0.074023 0.024044 -0.016865 0.003323 -0.123810 0.135842 0.200798
Raw Hides, Skins, Leather, and Furs 0.206454 0.084348 0.178427 0.121865 -0.101317 0.619649 0.102592 0.287981 -0.057302 -0.251296
Stone, Glass, and Ceramics 0.225603 0.111290 0.264504 0.269760 0.061210 -0.018303 0.176815 -0.368090 0.400752 -0.018317
Textiles 0.209936 -0.103101 -0.054321 -0.264334 -0.273480 0.542184 0.126078 -0.076346 0.216014 0.139611
Transportation 0.238443 -0.119628 0.256225 0.081261 0.184539 -0.059237 0.015553 -0.095856 0.219874 0.334486
Vegetable Products 0.245486 -0.284943 -0.122225 -0.029458 0.071482 -0.036325 -0.074223 0.202113 -0.106308 -0.240274
Wood and Wood Products 0.201178 -0.222154 -0.377723 -0.300290 -0.132165 0.007197 -0.048891 -0.185469 0.074259 -0.351532
Works of Art 0.089055 0.388758 -0.239308 -0.078967 0.157495 -0.041382 -0.486227 0.369405 0.561157 0.018339

treemap-sectors Source: generated by Data Visualization.R in https://github.com/walice/Trade-IFF

In [22]:
biplot_PCA(sector_features, 10, 5, 6, obs='reporter.ISO')

Explained variance

In [23]:
scree_plot(sector_features, show_explained_var=True)
Out[23]:
PC var
0 1 0.524988
1 2 0.121415
2 3 0.054159
3 4 0.049898
4 5 0.042107
5 6 0.041007
6 7 0.036014
7 8 0.028847
8 9 0.024146
9 10 0.018279
10 11 0.012558
11 12 0.009987
12 13 0.007011
13 14 0.006890
14 15 0.005213
15 16 0.004848
16 17 0.004001
17 18 0.002819
18 19 0.002489
19 20 0.001800
20 21 0.001524

Variance-stabilizing and normalizing transformations

In [24]:
# Plot distribution of illicit flow in each feature (i.e. sector)
fig, axes = joypy.joyplot(sector_features, colormap=plt.cm.viridis, figsize=(8,8),
                          title='Distribution of mis-invoicing across sectors');
In [28]:
sector_features_yeo = pd.DataFrame(sector_features_yeo,
                                   index=sector_features.index,
                                   columns=sector_features.columns)
fig, axes = joypy.joyplot(sector_features_yeo, colormap=plt.cm.viridis, figsize=(8,8),
                          title='Distribution of mis-invoicing across sectors (Yeo–Johnson transformation)');