Newcomb–Benford law and the detection of frauds in international trade

Significance The detection of frauds is one of the most prominent applications of the Newcomb–Benford law for significant digits. However, no general theory can exactly anticipate whether this law provides a valid model for genuine, that is, nonfraudulent, empirical observations, whose generating process cannot be known with certainty. Our first aim is then to establish conditions for the validity of the Newcomb–Benford law in the field of international trade data, where frauds typically involve huge amounts of money and constitute a major threat for national budgets. We also provide approximations to the distribution of test statistics when the Newcomb–Benford law does not hold, thus opening the door to the development of statistical procedures with good inferential properties and wide applicability.

The contrast of fraud in international trade is a crucial task of modern economic regulations. We develop statistical tools for the detection of frauds in customs declarations that rely on the Newcomb-Benford law for significant digits. Our first contribution is to show the features, in the context of a European Union market, of the traders for which the law should hold in the absence of fraudulent data manipulation. Our results shed light on a relevant and debated question, since no general known theory can exactly predict validity of the law for genuine empirical data. We also provide approximations to the distribution of test statistics when the Newcomb-Benford law does not hold. These approximations open the door to the development of modified goodness-of-fit procedures with wide applicability and good inferential properties.
statistical antifraud analysis | Newcomb-Benford law | customs fraud | customs valuation | anomaly detection T he contrast of fraud in international trade, and the corresponding protection of national budgets, is a crucial task of modern economic regulations. To give an idea of the volumes involved, in 2016 the customs duties flowing into the European Union (EU) budget amounted to more than 20 billion euros and provided about 15% of the total own resources of the EU. Huge losses thus occur when the value of imported goods is underreported (e.g., ref. 1). Most statistical antifraud techniques for international transactions fall in the class of unsupervised methods, with outlier detection and (robust) cluster analysis playing a prominent role (2)(3)(4)(5). The rationale is that the bulk of international trade data are made of legitimate transactions and major frauds may stand out as highly suspicious anomalies. Considerable emphasis is also put on procedures that provide stringent control of the number of false positives (6), since substantial investigations like the one reported in ref. 1 are demanding and time consuming. A related crucial requirement is the ability to deal with massive datasets of traders and to provide-as automatically as possible-a ranking of their degree of anomaly. This information is essential for the design of efficient and effective audit plans, a major task for customs offices.
In this work we consider fraud detection through the Newcomb-Benford law (NBL). This law defines a probability distribution for patterns of significant digits in real positive numbers. It relies on the intriguing fact that in many natural and human phenomena the leading-that is, the first significantdigits are not uniformly scattered, as one could naively expect, but follow a logarithmic-type distribution. We refer to refs. 7-10 for an historical summary of the NBL, an extensive review of its challenging mathematical properties, and a survey of its more relevant applications.
Despite its long history, the mathematical and statistical challenges of the NBL have been recognized only recently. From a mathematical perspective, appropriate versions of the law appear in integer sequences, such as the celebrated Fibonacci sequence (8) or the factorial sequence (11). The law also emerges in the context of floating-point arithmetic (12), while a deep probabilistic study was carried out by Hill (13). A seminal note by Varian (14) suggested the idea that agreement with the NBL could validate the "reasonableness" of data. Since then, it is now rather well known-mainly due to the work of Nigrini (see ref. 7, for a review of such studies)-that the NBL can be used as a forensic accounting and auditing tool for financial data. The law has been shown to be a valuable starting point for forensic accountants and to be applicable in a number of auditing contexts, such as external, internal, and governmental auditing. It has also been found successful for identifying the presence of misconduct in other domains, including the identification of irregularities in electoral data (15,16), campaign finance (17), and economic data (18).
Although the cited advances may suggest applicability of the NBL to international trade, there remain major unanswered questions that we address in our work. The first one concerns the trustworthiness of the NBL for genuine-that is, nonfraudulent-transactions. As shown in ref. 19, no general known theory can exactly predict whether the NBL should hold in any specific application, whose data-generating process cannot be known with certainty, even in the absence of fraud or

Significance
The detection of frauds is one of the most prominent applications of the Newcomb-Benford law for significant digits. However, no general theory can exactly anticipate whether this law provides a valid model for genuine, that is, nonfraudulent, empirical observations, whose generating process cannot be known with certainty. Our first aim is then to establish conditions for the validity of the Newcomb-Benford law in the field of international trade data, where frauds typically involve huge amounts of money and constitute a major threat for national budgets. We also provide approximations to the distribution of test statistics when the Newcomb-Benford law does not hold, thus opening the door to the development of statistical procedures with good inferential properties and wide applicability. other data manipulations; see also refs. 20-22 for related concerns. Our first goal is then to provide insight on the suitability of the NBL for modeling the distribution of digits of genuine transaction values arising in international trade. We use the Italian import market as a specimen for our study, but our approach is general and can be replicated for any country for which detailed customs data are available. Knowledge of the conditions under which the NBL should be expected to hold in the absence of data manipulation is an essential ingredient for the implementation of large-scale monitoring processes in which tens (or even hundreds) of thousands of traders are screened in an automatic and fast way with the aim of identifying the most suspicious cases. In SI Appendix, section 7 we describe a web application that has been developed to assist customs officers and auditors in this screening task, which can be executed in full autonomy on their own datasets. It may instead be very difficult to ascertain whether anomaly should be attributed to fraud or to model failure if the NBL does not provide a suitable model for genuine transactions; see also ref. 23, p. 193, for a similar concern.
Our second goal is to deepen our knowledge of the empirical behavior of NBL-conformance tests by investigating their power under different contamination schemes. The adoption of such tests for antifraud screening is based on the assumption that fabrication of data closely following the law is difficult and that fraudsters might be biased toward simpler digit distributions, such as the discrete uniform or the Dirac. We also quantify the corresponding false positive rates, to make explicit the different and possibly conflicting facets that empirical researchers have to balance in practice.
The third aim of our work is to provide corrections to test statistics when the NBL does not hold. This is typically the case for traders who operate on a limited number of products, so that there is not enough variability in their transactions. Even if the NBL is not a suitable model for genuine transaction digits, the conformance tests based on our modified statistics have the appropriate empirical size in the absence of data manipulation, while the usual tests turn out to be potentially very liberal. We argue that, having the required size under general trade conditions and being competitive in terms of power, the conformance tests based on our modified statistics are recommended. Therefore, they extend the applicability of large-scale monitoring processes of international trade data to a wider range of practical situations.
The NBL Statistical Background. Let D1(x ), D2(x ), . . . , be the first, the second, . . ., significant digit of the positive real number x . Let X be a positive real random variable defined on the probability space (Ω, F, P ). The NBL implies (and vice versa) that the following joint probability function holds for each k ∈ Z + , where d1 ∈ {1, . . . , 9} and d l ∈ {0, . . . , 9} for l = 2, . . . , k . A practically important special case is that of the first two significant digits (k = 2), for which Eq. 1 reduces to ρ2(d1, d2) = log 10 Similarly, the marginal probability function of D1(X ) is We refer to ref. 24 for a summary of the mechanisms that give rise to NBL-distributed data in accounting and finance. Among these, there are several statistical motivations for adopting the NBL as a model for the digits appearing in genuine international transactions. A major methodological basis relies on a limit theorem derived by Hill (13), to which we refer for the technical details. A key mathematical concept is that of a random probability measure, which is a function P : Ω → M-where M is the space of probability measures on R-defined on the underlying probability space (Ω, F, P ). For each Borel set B the function ω → P(ω)(B ) is a random variable; that is, P(ω) is a probability measure on R for each ω ∈ Ω. Another important related concept is that of a sequence of P-random M samples, where M ∈ Z + . It is a sequence (Xn ) n≥1 of random variables defined on (Ω, F, P ) such that, for each ω ∈ Ω, the first M random variables are drawn independently from the same random probability distribution P1(ω), selected according to the random probability measure P, the M subsequent random variables are drawn independently from the same random probability distribution P2(ω), in turn selected according to the random probability measure P, and so on. Hill's limit theorem then states that, if P satisfies some invariance conditions related to either the scale or the base of measurement, for each M ∈ Z + the P-random M -samples sequence (Xn ) n≥1 converges to the NBL with probability one. That is, for each k ∈ Z + and for i = 1, . . . , n, as .

[3]
A second reason for adopting the NBL is that multiplicative processes-which are at the heart of many financial datagenerate NBL-distributed data. More precisely, if (Xn ) n≥1 is a sequence of independent and identically distributed random variables such that P (X1 = 0) = 0, as n → ∞ the sequence ( n i=1 Xi ) n≥1 converges to the NBL with probability one (theorem 8.16 in ref. 8). It can be shown that convergence is extremely fast since it is exponential in n (25). It is also remarkable that, given two independent random variables X and Y only one of which follows the NBL, the product XY is distributed according the NBL provided that P (XY = 0) = 0 (theorem 8.12 in ref. 8). Finally, NBL-distributed data may also originate from random variables raised to integer powers. If X is an absolutely continuous random variable, as n → ∞ the sequence (X n ) n≥1 converges to the NBL with probability one (theorem 8.8 in ref. 8).
Relevance for International Trade. Our applied focus is on transactions involving EU traders; we refer to SI Appendix, sections 3 and 7 for the institutional regulations supporting their analysis. By international trade data we mean the data collected by EU member states for imports and exports that are declared by national traders and shipping agents using the form called the Single Administrative Document (SAD). The value that we analyze for antifraud purposes is the "statistical value" reported in each SAD, which also includes the costs of insurance and freight (CIF) and is given in euros by taking into account the exchange rate (26). Our interest is then on random variables X1, . . . , Xn defined on the product space where Ui and Qi are nonnegative random variables representing the (CIF-type) unit price in euros and the traded quantity in transaction i. If we rephrase [3] in the context of trade, n corresponds to the number of transactions made by the trader of interest, so that X1, . . . , Xn is the available sample of transaction values, and the ratio m = n/M is the corresponding number of traded goods (provided that m is an integer).
There are different economic reasons suggesting that the distribution of the significant digits contained in X1, . . . , Xn may, under some conditions, be well approximated by the NBL. First, markets are hit by specific shocks and show peculiar reactions to common shocks (27). This, coupled with differences in the trader size and product quality, generates different economic processes for prices and quantities determination, which imply in turn that the observed data of prices and quantities may be described by different trader-specific probability distributions, not exactly predictable in advance. In view of [3], it is then sensible to anticipate good conformance to the NBL when a trader operates by importing or exporting a sufficiently large number of different goods, even if none of the product-specific marginal distributions of digits follows the law. The economic literature also shows that traders have different degrees of market power. Trading operations are affected by market and country features, such as different trade costs and different access to credit (e.g., ref. 28). Therefore, transactions made with different counterparties may be characterized by different economic processes, yielding distributions for transaction values that can be conceived to vary randomly from one product to another for each trader. The significant-digit distribution in international transactions can thus be expected to adhere to the NBL when the trader makes a sufficiently large number of operations, with a sufficiently large number of counterparties, possibly located in different countries.

A Contamination Model for Fraud
The Model. We phrase our antifraud approach within the framework of a trader-specific contamination model where each fraud corresponds to an outlier. For this purpose, we need a slight change in notation and we write nt for the number of transactions made by trader t, which operates on mt distinct products and for which the positive random variable X (t) now represents a transaction value. We then define and let T denote the total number of traders in the market.
For t = 1, . . . , T and each k ∈ Z + , the general form of our contamination model is is the probability of the same event for a manipulated transaction, and 0 ≤ τt ≤ 1 is the probability of fraud for trader t. Although it is convenient to work in the digit space through π (t) k (d1, . . . , d k ), model 5 has a counterpart in the transaction space defined by X (t) . The latter is given in SI Appendix, section 1.
Model 5 provides a principled framework for antifraud analysis of international trade data. Indeed, trader t may be considered a potential fraudster if the null hypothesis is rejected, in favor of the alternative H (t) A useful tractable version of contamination model 5 assumes that the probability of observing a given k -ple of digits in a genuine transaction of trader t depends on the trader features only through the values of mt and nt ; that is, Therefore, for each k ∈ Z + , the model becomes with [6] again stating the absence of fraud. Model 7 implies that the random vector (D1(X (t) ), . . . , D k (X (t) )) is independent of any other trader-specific random variable, given the values of mt and nt . Although this structure is clearly an approximation, it is coherent with the discussion about the economic elements that make the NBL a plausible model for the digit distribution in genuine international transactions. A further bonus of models 5 and 7 is that they make clear the antifraud advantages of our methodology over the often uninformative analysis of aggregated data, as given, for example, in ref. 18. In the latter instance, for each k ∈ Z + , the underlying contamination model would be where the quantities involved are now constant for the whole (product-specific) market. Testing the hypothesis that τ = 0 in this restricted model requires a sample X1, . . . , XT obtained from T traders, for which just one replicate is available. However, the inferential conclusion that τ > 0 is much less informative than rejection of [6] for some t ∈ {1, . . . , T }. In fact, τ > 0 yields no information on the specific traders that are responsible for rejection and identification of the fraudsters must be left to further nonstatistical investigations. Another notable advantage is that models 5 and 7 acknowledge the existence of a trader-specific propensity to fraud.
Testing the Absence of Fraud. The usual hypothesis of interest in the antifraud literature (7,10) is is the NBL. Several statistics exist for testing [8] for a given value of k , the simplest one being the χ 2 statistic where N (t) In practice only NBL marginals of low order are analyzed. The two-digit version of [9], that is, V {2} , respectively. In our empirical study we also consider the multiple-stage approach proposed by Barabesi et al. (6) with the aim of introducing a more stringent control on the proportion of false discoveries. This approach tests a decreasing sequence of lowerdimensional marginals of the NBL through their exact conditional distributions. Specifically, in the simple two-step version that we consider here, the method of Barabesi et al. (6) first tests the two-digit marginal 2 of the NBL by comparing V to the quantiles of its exact distribution under the null, which are approximated through an efficient Monte Carlo scheme. Then, if the 2D NBL is rejected, the fit to the 1D marginals is tested by {2} , given rejection of the 2D hypothesis, instead of their marginal ones. Type-I error rates are thus controlled at the prescribed level (e.g., 1%) at each step of the procedure, both in the two-digit and in the one-digit tests. Furthermore, the outcome on the one-digit tests reveals which digit is responsible for nonconformance to [2].
Since χ 2 tests may also have some shortcomings (ref. 10, chap. 37), additional procedures not based on [9] and less formal methods are considered in SI Appendix, sections 5 and 6. Qualitative findings are similar in all cases. Nevertheless, for our purposes it is instructive to look at the results for χ 2 tests, because their distribution (either exact or asymptotic) is known under the NBL. We can thus look at the agreement between the empirical and the nominal distribution of the test statistics to assess whether genuine transactions actually follow the law, is the NBL.

Adequacy of the NBL for Trade Data
Although the theoretical results sketched in the statistical background and the subsequent economic arguments broadly motivate the adoption of the NBL as a sensible model for genuine transactions in the context of international trade, it is unclear how they may fit to empirical transactions whose generating mechanism cannot be exactly known and obviously involves only a finite number of terms. One goal of our study is then to provide evidence on the quality of the NBL assumption 1 to the digit distribution of transaction values for noncheating traders that operate in real international markets. For this purpose, we assume that our contamination model holds with τt = 0 for each trader. We also take [7] as a sensible and practically workable approximation to this model in the absence of a priori information on the trader.
We simulate nonmanipulated statistical values, according to definition 4, for T † "idealized" traders in each relevant configuration of trade represented by a pair (mt , nt ). For this aim, we sample transactions with replacement from the Cartesian product spaces where Uj = {u1, . . . , un j } and Qj = {q1, . . . , qn j } denote the sets of unit prices (CIF-type) and traded quantities, respectively, originated in all of the market transactions involving good j , nj is the number of such transactions, and G is the total number of goods in the market. The details of the simulation algorithm are reported in SI Appendix, section 2. In our experimental setting the values of mt and nt are fixed by design, while in empirical analysis we instead condition on the observed values of mt and nt for the trader under scrutiny. We replicate genuine international trading behavior in one specific EU market by picking unit price and traded quantity at random from the database of one calendar year Italian customs declarations, after appropriate trader and product anonymization making it impossible to infer the features of individual operators. Two databases of simulated transactions (pseudo-datasets) similar to those analyzed in this work can be accessed through SI Appendix, section 3, where their structure is explained. A description of our code is also given in SI Appendix, section 3.
For each idealized trader t and a chosen value of k , we compare the observed distribution of digits to the theoretical NBL values 1 through the test statistic V (t) {1,...,k } . This statistic will be asymptotically distributed as is indeed the NBL. Furthermore, its exact distribution under the k -digit NBL hypothesis can be approximated to an arbitrary degree of accuracy through the Monte Carlo approach of Barabesi et al. (6). We thus take the discrepancy between the estimated distribution of V for α in the usual range of significance levels. Although a value ofα close to α does not imply that the empirical distribution of V over all its support, it tells us that the approximation is satisfactory for the purpose for which V (t) {1,...,k } is computed in antifraud analysis. The insight that we gain from our study is twofold. First, we shed light on the trading configurations-represented in terms of pairs (mt , nt )that ensure close agreement between Ψ {1} , which is likely to be method of choice by many antifraud practitioners in automated large-scale auditing processes. As a reference, we also provide the estimated test sizes for the two-stage (TS) version of the procedure of Barabesi et al. (6) and for the two-digit statistic V (t) {1,2} . The former is intended to be a reasonable compromise between simplicity of use and strong reduction in the rate of false detections, while the latter is often recommended in applications with not-too-small sample sizes (ref. 7, p. 79). We estimate test sizes using [11] for a wide range of pairs (mt , nt ), with mt ≤ nt . The chosen grid represents the features of some of the most relevant traders in the empirical analysis of customs declarations. In fact, the importers for which nt < 50 cover less than 14% of the recorded transactions in our customs database and an even smaller quota in terms of traded values. Very big traders are not common: To give an idea, nt > 2,000 for less than 0.1% of the importers in the database, and almost 40% of the recorded transactions refer to traders with 50 ≤ nt ≤ 2,000. We present only the findings for the case α = 0.01, similar conclusions being valid for other significance levels. Table 1 displays the estimated sizes of the test of the firstdigit marginal hypothesis for both V (t) {1} (using the quantiles of its asymptotic distribution) and TS. These estimates are computed on T † = 85, 500 idealized noncheating traders, pooled across different scenarios with the same pair (mt , nt ). One striking feature of the reported values ofα in Table 1 is that they vary considerably according to the specific trading configuration. This result clearly supports the conjecture that in a realistic market scenario both mt and nt are crucial factors in determining the adequacy of the NBL as a valid model for the empirical digit distribution in the absence of data manipulation. If only one digit is considered, a sample size of nt = 50 transactions can be considered sufficiently large to justify the asymptotic χ 2 8 approximation to the distribution of V {1} to the nominal χ 2 8 distribution. Our results point to the conclusion that the NBL is not a satisfactory model when mt is much smaller than nt . This statement is verified consistently over all market configurations and does not depend on the specific testing methodology. Indeed, also the potentially very conservative TS procedure can become considerably liberal if mt nt . The same is true for other adjustments to V (t) {1} that control for multiplicity of tests among traders, not reported here. We argue that lack of variability in the transactions made by trader t is the main reason for the discrepancy between the NBL and Ψ (m t ,n t ) k (d1, . . . , d k ) when mt is small. Whatever the interpretation, our simulation results confirm that the asymptotic framework set by [3] does not hold if mt = o(nt ), requiring instead mt = O(nt ). Our results also quantify how much deleterious can be the effect of keeping mt fixed on the distribution of test statistics. Indeed, they show that in this setting an increase of the sample size nt worsens the situation, since it points to a "wrong" asymptotic direction. The clear message is then that standard conformance tests, such as V (t) {1,...,k } , should not be used for antifraud purposes when mt nt , because the hypotheses 6 and 8 cannot be taken any longer to be equivalent.
We conclude this section with a glimpse of the performance of the two-digit statistic V  Table 2. As expected, convergence to the χ 2 89 distribution is slower than convergence to χ 2 8 in the one-digit case. The adoption of exact quantiles should thus be preferred with V (t) {1,2} , except in the instance of large values of both nt and mt . Our results confirm the relationship between accuracy of the NBL approximation and the ratio mt /nt , suggesting mt ≥ 0.2nt as a sensible rule of thumb when the exact quantiles are used. They also provide a clue of the strategy to be adopted with more complex large-k procedures.
Enemy Brothers: Power and False Positive Rate When model 7 holds with τt > 0 for one or more traders, we write TNF = {t : τt = 0} and TF = {t : τt > 0} for the sets corresponding to noncheating traders and fraudsters, respectively.   Power (P) is defined as the proportion of traders in TF that are correctly identified as potential fraudsters. The false positive rate (FPR) is the proportion of rejections of the null hypothesis 6 that turn out to be wrong, since they refer to traders that belong to TNF. Both performance measures play a crucial role when antifraud analysis is put into practice. In our simulations we compare the results under different contaminant distributions Υ Our first contamination instance assumes that the first two digits of τt nt transactions from trader t ∈ TF are generated according to the discrete uniform distribution on {10, . . . , 99}. Therefore, for d1 ∈ {1, . . . , 9} and d2 ∈ {0, . . . , 9}. The uniform distribution provides an unfavorable scenario for fraud detection, since Υ (t) 2 (d1, d2) is then close to the NBL marginal probability 2 for most digit pairs (d1, d2). Our second contamination scheme instead concentrates frauds on a specific digit pair, say (d1,d2), randomly selected from the discrete uniform distribution on {10, . . . , 99}. The contaminated model thus becomes

[13]
Although this Dirac-type contamination may at first sight appear extreme, our experience with manipulated declarations is that similar patterns may arise rather frequently among the transactions found to be fraudulent, especially when contamination is due to the attempt to circumvent threshold-depending duties, either "ad valorem"-that is, computed as a percentage of the declared value-or fixed. In fact, the attempt to declare quantities below the threshold (or above it, according to the specific regulation) typically produces a bias in the corresponding values similar to that represented by a Dirac-type model. Other instances of contamination are considered in SI Appendix, section 4.
We consider the simplified case where τt is the same for each t ∈ TF. We take τt = 0.2, 0.5, 0.8, to represent three increasing levels of individual propensity to fraud. We also define the proportion of fraudsters in the whole market as where T = TNF TF is the set of all traders. We fix ς = 0.05, 0.1, to investigate the effect of different degrees of fraud diffusion in the market. Our estimates of P and FPR are based on T † = 10, 000 idealized traders, independently generated in each configuration. Nonmanipulated transactions are again simulated with the algorithm described in SI Appendix, section 2. We restrict our analysis to the market configurations for which the NBL approximation to Ψ (m t ,n t ) 2 (d1, d2) is good, and the empirical test sizes closely match the nominal one, to avoid confounding between power and lack of fit. We give results only for the configurations with mt = nt . Pairs where mt is of the same order of magnitude as nt yield qualitatively similar findings and are not reported. Table 3 shows the estimated values of P and FPR under the uniform contamination model 12 for V (t) {1} , using the asymptotic quantile χ 2 8,0.99 , and for the TS version of the procedure of Barabesi et al. (6). Not surprisingly, the detection rates are low in the case of sporadic contamination (τt = 0.2). It is apparent that no statistical method can be expected to have high power against "well-masked" frauds, unless the number of contaminated transactions becomes relatively large. Indeed, it is clearly seen that P rapidly grows with both τt and nt , leading to almost sure detection of fraudsters even through the potentially very conservative TS procedure (e.g., when τt = 0.8 and nt ≥ 200). Both methods thus prove to be able to identify the traders belonging to TF if there is enough information on the contaminant distribution in the available data, also in the unfavorable framework provided by [12]. The value of FPR is much higher with V {1} and TS should then depend on the user's attitude toward FPR and toward the power reduction implied by TS in situations of intermediate contamination. The value of ς does not have a major impact on P, thus suggesting that our procedures can be equally effective in detecting isolated fraudsters and more diffuse illegal trading behavior. However, a considerable increase in FPR is to be expected in the former situation, especially for V (t) {1} . Table 4 repeats the analysis under the Dirac-type scheme 13. The contaminant distribution is now well separated from Ψ (m t ,n t ) 2 (d1, d2) and both methods generally have excellent detection properties, with some minor differences only in the problematic case τt = 0.2. However, FPR is much higher for V (t) {1} . In such contamination frameworks the TS procedure thus comes closer to performing like an "ideal" test, leading to the  {1} .

Corrections to Goodness-of-Fit Statistics
We now focus on the trading configurations for which the NBL does not provide a satisfactory representation of the genuine digit distribution Ψ If t is the trader of interest, let t * be an idealized noncheating trader such that t * = t, while mt * = mt and nt * = nt . The set of transactions for trader t * is randomly generated according to the algorithm described in SI Appendix, section 2, and the resulting statistical values are collected in vector x (t * ) , say. Correspondingly, let V (t * ) {1,...,k } be the test statistic 9 computed for trader t * . Under model 7, the significant-digit random variables associated to the elements of x (t * ) can be considered as independent copies of those associated to the elements of X (t) , in the absence of data manipulation. We thus estimate the unknown null distribution function F V (t) {1,...,k } as a Monte Carlo average over T * replicates of t * . This yields for v ∈ R + , andζ for the corresponding estimate of the γ quantile. Therefore, we reject hypothesis 6 at nominal test size α, and we consider trader t a potential fraudster, if where v Motivated by large-scale applications, Efron (30) describes a related methodology for empirically estimating a null distribution when the standard theoretical model (such as the NBL in the case of digit counts) does not hold. This approach uses the available data to estimate an appropriate version of the distribution of the test statistic under the null hypothesis. However, it is apparent that empirical null estimation is not directly feasible when recast in the framework of models 5 and 7. One reason is that the method generally requires a known parametric form for the null distribution, whose parameters are then estimated from the available realizations of the test statistic. Even more fundamentally, in our applied context there is no guarantee that the proportion of genuine transactions is large for each trader, that is, that τt is small for each t in models 5 and 7, thus violating a key assumption for empirical null estimation (ref.

30, p. 98).
On the other hand, the proportion of transactions that involve manipulated data and their impact on F V (t) {1,...,k } is arguably small when considering the Cartesian products defined in Eq. 10. First, both Uj and Qj are not trader specific, since they contain all of the transactions in the market for the corresponding good, and the resulting idealized transactions are further aggregated to obtain the required basket of nt transactions on mt products. Second, as already reviewed in the statistical background, an intrinsic robustness property of the NBL specification of our contamination model arises from decomposition 4, since the product of independent random variables follows the NBL if only one of the factors does, regardless of the other factors (ref. 8, p. 188). We may thus expect a reduction in the contamination effect produced by a manipulated element of Uj (respectively, Qj ), after multiplication by a genuine element of Qj (respectively, Uj ). Third, if the NBL does not hold, the contaminant distribution Υ (t) k (d1, . . . , d k ) for a trader t may not be too far from the genuine distribution Ψ t (d1, . . . , d k ) for some other trader t = t, which further reduces the degree of anomaly of the corresponding realizations in the whole market. We thus see our estimate F V (t) is estimated by exploiting all of the potential samples that could have been observed given the realized transactions in the market. Since the cardinality of this sample space is very large, we finally resort to Monte Carlo simulation for approximating the extended empirical null. Table 5 reports the estimated sizesα for different values of nt and for mt = 1, when test 15 is performed at α = 0.01 on the same sets of t = 1, . . . , 85, 500 idealized traders already considered in Table 1, and the Monte Carlo average in [14] is computed on T * = 10, 000 independent replicates for each value of nt . The analysis for the case mt = 5 is given in SI Appendix, Table 5. Estimates of test size, P, and FPR using modified procedures 15 and 16, with T * = 10, 000, for different values of n t and for m t = 1 Uniform contamination (Eq. 12) Dirac-type contamination (Eq. 13) The estimated test sizes for V (t) {1} are also given as a reference. The nominal test size is α = 0.01. The number of independent idealized traders in each market configuration is T † = 85, 500 for procedure 15 and T † = 10, 000 for procedure 16, P and FPR. ς = 0.05 when computing P and FPR. section 5. In all instances, comparison with the estimated sizes of the liberal χ 2 8 test (copied from Table 1) shows that the improvement provided by our procedure is paramount. The appropriate size is also reached when nt grows, while mt is kept fixed. Therefore, our approach provides a valid test of [6] even when the asymptotic framework does not comply with the requirements of Hill's limit theorem.
We then compute P and FPR for test 15, under the uniform contamination model 12 and the Dirac-type contamination scheme 13, using the same sets of t = 1, . . . , 10, 000 idealized traders already considered in Tables 3 and 4. For simplicity, we restrict our analysis to ς = 0.05 and τt = 0.5, 0.8, similar qualitative conclusions being reached in the other cases. The results are again reported in Table 5 and in SI Appendix, section 5, for mt = 1 and mt = 5, respectively. We see that test 15 can have severe difficulties in discriminating between TF and TNF, unless Ψ (m t ,n t ) k (d1, . . . , d k ) and Υ (t) k (d1, . . . , d k ) are well separated or τt is close to one. One reason for the observed loss of power is the large number of goods that are potentially involved in the Monte Carlo estimation process. Indeed, mt * = mt for each idealized trader t * contributing to [14], but the specific goods for which the digit distribution is obtained usually vary from trader to trader. This variability inflates the quantile estimateζγ, especially when the ratio nt /mt increases.
We can obtain an improved estimate of the required quantile ζγ by adopting a refined version of model 7. In this specification the genuine digit distribution depends not only on mt , but also on the specific set of goods, say Gt , dealt with by trader t. Consequently, we now generate the behavior of T * idealized noncheating traders t * with the constraint that Gt * = Gt . [16] The number of ways in which a basket of mt products can be selected out of G possible goods will be huge in any real-world scenario. Computation ofζγ thus becomes trader specific and cannot be automated before knowing the exact composition of Gt , differently fromζγ, which depends only on the pair (mt , nt ). Nevertheless, estimation time is still acceptable for routine application of the methodology. For instance, in our experiment computation ofζγ using T * = 10, 000 replicates takes on average less than 0.5 s for a trader t with nt = 200 and mt = 5.
The performance of the refined test procedure 16 is displayed in Table 5 (for mt = 1) and in SI Appendix, section 5 (for mt = 5). All of the estimated sizes are very close to the nominal target α = 0.01 and similar to those obtained through [15]. Power values are comparable for the three reported tests when the genuine and the contaminant digit distributions are well separated. However, our proposals are still preferred since their FPR is considerably lower than for V (t) {1} . It is in the case of intermediate contamination, as under the uniform model, that the refined estimatorζγ shows much higher efficiency thanζγ. In this instance rule 16 ensures that the reduction in power with respect to the χ 2 8 test is minor, while keeping considerably lower values of FPR. We thus conclude that, having the appropriate size and power properties comparable to those of the liberal standard procedure, our modified tests 15 and 16 are recommended whenever the attained levels of FPR can be tolerated in practice.

Case Studies
To illustrate the use of the proposed procedure and its ability to detect relevant value manipulations, we first discuss the case of a trader extracted from an archive of fraudulent declarations provided by the Italian customs after appropriate data anonymization. The same archive was also used in ref. 6. The trader under scrutiny has nt = 648 import transactions on mt = 38 products from January 2014 to June 2015. The quantities and values appearing in the declarations of the three most traded products (not labeled for confidentiality reasons) are represented as (red) solid circles in the scatter plots of Fig. 1. The information displayed in such scatter plots is the input for some commonly adopted (robust) regression techniques aiming at the automatic detection of value frauds in customs data; see, e.g., ref. 31 and SI Appendix, section 7 for further details. However, the plots for this trader do not provide clear evidence of substantial undervaluation or of other major anomalies, although two of the declarations displayed in Fig. 1, Center were found to be fraudulent after substantial investigation. Our testing procedure instead produces a strong signal of contamination of the digit distribution. In fact, restricting for simplicity to the first digit, we obtain v  simulated traders with the same values of mt and nt . By applying rule 15, we can thus conclude that hypothesis 6 can be safely rejected when the focus is shifted from individual transactions, as in Fig. 1, to the whole trader activity, as in our test.
The strength of evidence against the null may suggest the existence in the administrative records of this trader of a larger number of manipulated declarations than the two already detected. It also suggests that our method could be helpful in providing authorities with evidence of potential fraud among traders not previously classified as fraudsters or even not considered as suspicious. In view of contamination models 5 and 7, and of our simulation results, we expect this information gain to be higher in the case of serial misconduct. Additional investigations for this trader are given in SI Appendix, section 6. Although all methods point to the same conclusion, we remark that simple graphical tools for conformance checking-such as histograms-require substantial human interpretation and thus cannot be routinely applied on thousands of traders.
We now move to (anonymized) data provided by the customs office of another EU member state, not disclosed for its specific confidentiality policy, that we label as MS2. The data were collected in the context of a specific operation on undervaluation, focusing on a limited set of products traded by fraudulent operators that have systematically falsified the import values. The traders classified as nonfraudulent were audited by the customs officers of MS2 and no indications of possible manipulation of import values were found. Although the absence of fraud can never be anticipated with certainty, we can thus place good confidence on these statements of genuine behavior. In SI Appendix, section 6 and Table S7 we provide empirical investigations of the first-digit distribution of the 15 traders in this small benchmark study for which nt ≥ 50, as in our simulation experiments. We apply test 16 instead of test 15, since the available database is limited to a basket of fraud-sensitive products, and we keep α = 0.01 and T * = 10, 000 for each observed pair (mt , nt ). We give the estimated P value of each test, computed as {1} ), and-as a reference-the asymptotic P value from the χ 2 8 distribution that assumes validity of the NBL. It can be seen that our approach gives very good results, both when applied to fraudsters-it clearly rejects the hypothesis of no contamination for five traders-and in the case of genuine behavior-none of the supposedly honest traders is flagged by our test at α = 0.01. Therefore, this study supports the claim that our methodology can be an effective aid to the preparation of the audit plans of customs services, given its ability to point to potential serial fraudsters, in agreement with current guidelines for the customs modernization process (32). We finally note the beneficial effect of our correction for one supposedly honest trader shown in SI Appendix, Table S7, whose small basket of traded products may imply spurious deviation from the NBL when the classic χ 2 8 approximation is used. An extreme example of this effect is also shown in SI Appendix, section 6.

Discussion
We have developed a principled framework for goodness-of-fit testing of the NBL for antifraud purposes, with a focus on customs data collected in international trade. Our approach relies on a trader-specific contamination model, under which fraud detection has close connections with outlier testing. We have given simulation evidence, in the context of a real EU market, showing the features of the traders for which we can expect the genuine digit distribution to be well approximated by the NBL. Our simulation experiment is an empirical study addressing this issue in detail in the context of international trade, where the contrast of fraud has become a crucial task and substantial investigations are often demanding and time consuming. We have also provided simulation-based approximations to the distribution of test statistics when the conditions ensuring the validity of the NBL do not hold. These approximations open the door to the development of goodness-of-fit procedures with good inferential properties and wide applicability.
Our methodology is general and potentially applicable to any country, or year, for which detailed customs data are available. Being mostly automatic, it is suited to be implemented in largescale monitoring processes in which thousands of traders are screened to find the most suspicious cases. It can also be a valuable aid to the design of efficient and effective audit plans. Although we expect our general guidelines to remain valid in other empirical studies, the specific quantitative findings may clearly vary from one country (year) to another.
A bonus of our contamination approach is that it makes clear the setting in which statistical antifraud analysis takes place. Our conformance testing procedures mainly aim at the detection of serial fraudsters, for which information accumulates in the corresponding transaction records. The generation of low-price clusters of anomalous transactions is a typical consequence of this cheating behavior, and robust clustering techniques can also be used for its detection (e.g., ref. 4). However, rejection of our goodness-of-fit null hypotheses often provides more compelling evidence of fraud, also because it may not be easy to identify the low-price clusters that actually correspond to illegal declarations. Testing conformance to the NBL, or to another suitable distribution for genuine digits, thus shifts the detection focus from individual transactions to the full set of data from each trader.
A word of caution concerns the fact that not all possible frauds can be detected by our method, even when we restrict to manipulation of transaction values. For instance, we cannot expect any statistical procedure (including our own proposal) to have high power against data fabrication methods that preserve the validity of the NBL, at least approximately, and against occasional frauds for which statistical tests are not powerful enough. Therefore, we do not see our methodology as the ultimate antifraud tool, but as a powerful procedure to be possibly coupled with additional information. We support integration of the signals provided by our method with those obtained through alternative statistical techniques and with less technical modelfree analyses-such as those developed in refs. 7 and 10-that can be applied on a restricted number of traders. Indeed, we see our approach as a suitable automatic tool for selecting the most interesting cases for additional qualitative and quantitative investigations, while ensuring control of the statistical properties of the adopted tests.

ACKNOWLEDGMENTS.
We are grateful to Emmanuele Sordini for his contribution to the development of Web Ariadne, to Alessio Farcomeni for discussion on a previous draft, and to the reviewers for their helpful comments. The Joint Research Centre of the European Commission supported this work through the "Technology Transfer Office" project of the 2014-2020 Work Programme, in the framework of collaboration with EU member states customs and with the EU Anti-Fraud Office. This research line would not be feasible without factual collaboration of the customs services, enabled by the Hercule III Anti-fraud Programme of the European Union.