1[![Downloads](https://pepy.tech/badge/benford-py)](https://pepy.tech/project/benford-py) 2 3# Benford for Python 4 5-------------------------------------------------------------------------------- 6 7**Citing** 8 9 10If you find *Benford_py* useful in your research, please consider adding the following citation: 11 12```bibtex 13@misc{benford_py, 14 author = {Marcel, Milcent}, 15 title = {{Benford_py: a Python Implementation of Benford's Law Tests}}, 16 year = {2017}, 17 publisher = {GitHub}, 18 journal = {GitHub repository}, 19 howpublished = {\url{https://github.com/milcent/benford_py}}, 20} 21``` 22 23-------------------------------------------------------------------------------- 24 25`current version = 0.5.0` 26 27### See [release notes](https://github.com/milcent/benford_py/releases/) for features in this and in older versions 28 29### Python versions >= 3.6 30 31### Installation 32 33Benford_py is a package in PyPi, so you can install with pip: 34 35`pip install benford_py` 36 37or 38 39`pip install benford-py` 40 41Or you can cd into the site-packages subfolder of your python distribution (or environment) and git clone from there: 42 43`git clone https://github.com/milcent/benford_py` 44 45For a quick start, please go to the [Demo notebook](https://github.com/milcent/benford_py/blob/master/Demo.ipynb), in which I show examples on how to run the tests with the SPY (S&P 500 ETF) daily returns. 46 47For more fine-grained details of the functions and classes, see the [docs](https://benford-py.readthedocs.io/en/latest/index.html). 48 49### Background 50 51The first digit of a number is its leftmost digit. 52<p align="center"> 53 <img alt="First Digits" src="https://github.com/milcent/benford_py/blob/master/img/First_Digits.png"> 54</p> 55 56Since the first digit of any number can range from "1" to "9" 57(not considering "0"), it would be intuitively expected that the 58proportion of each occurrence in a set of numerical records would 59be uniformly distributed at 1/9, i.e., approximately 0.1111, 60or 11.11%. 61 62[Benford's Law](https://en.wikipedia.org/wiki/Benford%27s_law), 63also known as the Law of First Digits or the Phenomenon of 64Significant Digits, is the finding that the first digits of the 65numbers found in series of records of the most varied sources do 66not display a uniform distribution, but rather are arranged in such 67a way that the digit "1" is the most frequent, followed by "2", 68"3", and so in a successive and decremental way down to "9", 69which presents the lowest frequency as the first digit. 70 71The expected distributions of the First Digits in a 72Benford-compliant data set are the ones shown below: 73<p align="center"> 74 <img alt="Expected Distributions of First Digits" src="https://github.com/milcent/benford_py/blob/master/img/First.png"> 75</p> 76 77The first record on the subject dates from 1881, in the work of 78Simon Newcomb, an American-Canadian astronomer and mathematician, 79who noted that in the logarithmic tables the first pages, which 80contained logarithms beginning with the numerals "1" and "2", 81were more worn out, that is, more consulted. 82 83<p align="center"> 84 <img alt="Simon Newcomb" src="https://github.com/milcent/benford_py/blob/master/img/Simon_Newcomb_APS.jpg"> 85</p> 86<p align="center"> 87 Simon Newcomb, 1835-1909. 88</p> 89 90In that same article, Newcomb proposed the formula for the 91probability of a certain digit "d" being the first digit of a 92number, given by the following equation. 93 94<p align="center"> 95 <img alt="First digit equation" src="https://github.com/milcent/benford_py/blob/master/img/formula.png"> 96</p> 97<p align="center"> where: P (D = d) is the probability that 98 the first digit is equal to d, and d is an integer ranging 99 from 1 to 9. 100</p> 101 102In 1938, the American physicist Frank Benford revisited the 103phenomenon, which he called the "Law of Anomalous Numbers," in 104a survey with more than 20,000 observations of empirical data 105compiled from various sources, ranging from areas of rivers to 106molecular weights of chemical compounds, including cost data, 107address numbers, population sizes and physical constants. All 108of them, to a greater or lesser extent, followed such 109distribution. 110 111<p align="center"> 112 <img alt="Frank Benford" src="https://github.com/milcent/benford_py/blob/master/img/2429_Benford-Frank.jpg"> 113</p> 114<p align="center"> 115 Frank Albert Benford, Jr., 1883-1948. 116</p> 117 118The extent of Benford's work seems to have been one good reason 119for the phenomenon to be popularized with his name, though 120described by Newcomb 57 years earlier. 121 122Derivations of the original formula were also applied in the 123expected findings of the proportions of digits in other 124positions in the number, as in the case of the second digit 125(BENFORD, 1938), as well as combinations, such as the first 126two digits of a number (NIGRINI, 2012, p.5). 127 128Only in 1995, however, was the phenomenon proven by Hill. 129His proof was based on the fact that numbers in data series 130following the Benford Law are, in effect, "second generation" 131distributions, ie combinations of other distributions. 132The union of randomly drawn samples from various distributions 133forms a distribution that respects Benford's Law (HILL, 1995). 134 135When grouped in ascending order, data that obey Benford's Law 136must approximate a geometric sequence (NIGRINI, 2012, page 21). 137From this it follows that the logarithms of this ordered series 138must form a straight line. In addition, the mantissas (decimal 139parts) of the logarithms of these numbers must be uniformly 140distributed in the interval [0,1] (NIGRINI, 2012, p.10). 141 142In general, a series of numerical records follows Benford's Law 143when (NIGRINI, 2012, p.21): 144* it represents magnitudes of events or events, such as populations 145of cities, flows of water in rivers or sizes of celestial bodies; 146* it does not have pre-established minimum or maximum limits; 147* it is not made up of numbers used as identifiers, such as 148identity or social security numbers, bank accounts, telephone numbers; and 149* its mean is less than the median, and the data is not 150concentrated around the mean. 151 152It follows from this expected distribution that, if the set of 153numbers in a series of records that usually respects the Law 154shows a deviation in the proportions found, there may be 155distortions, whether intentional or not. 156 157Benford's Law has been used in [several fields](http://www.benfordonline.net/). 158Afer asserting that the usual data type is Benford-compliant, 159one can study samples from the same data type tin search of 160inconsistencies, errors or even [fraud](https://www.amazon.com.br/Benfords-Law-Applications-Accounting-Detection/dp/1118152859). 161 162This open source module is an attempt to facilitate the 163performance of Benford's Law-related tests by people using 164Python, whether interactively or in an automated, scripting way. 165 166It uses the versatility of numpy and pandas, along with 167matplotlib for vizualization, to deliver results like the one 168bellow and much more. 169 170![Sample Image](https://github.com/milcent/benford_py/blob/master/img/SPY-f2d-conf_level-95.png) 171 172It has been a long time since I last tested it in Python 2. The death clock has stopped ticking, so officially it is for Python 3 now. It should work on Linux, Windows and Mac, but please file a bug report if you run into some trouble. 173 174Also, if you have some nice data set that we can run these tests on, let'us try it. 175 176Thanks! 177 178Milcent 179