xref: /freebsd/contrib/file/magic/Magdir/statistics (revision a4d6d3b8)
143a5ec4eSXin LI
243a5ec4eSXin LI#------------------------------------------------------------------------------
3*a4d6d3b8SXin LI# $File: statistics,v 1.3 2022/03/24 15:48:58 christos Exp $
443a5ec4eSXin LI# statistics:  file(1) magic for statistics related software
543a5ec4eSXin LI#
643a5ec4eSXin LI
743a5ec4eSXin LI# From Remy Rampin
843a5ec4eSXin LI
943a5ec4eSXin LI# Stata is a statistical software tool that was created in 1985. While I
1043a5ec4eSXin LI# don't personally use it, data files in its native (proprietary) format
1143a5ec4eSXin LI# are common (.dta files).
1243a5ec4eSXin LI#
1343a5ec4eSXin LI# Because they are so common, especially in statistical and social
1443a5ec4eSXin LI# sciences, Stata files and SPSS files can be opened by a lot of modern
1543a5ec4eSXin LI# software, for example Python's pandas package provides built-in
1643a5ec4eSXin LI# support for them (read_stata() and read_spss()).
1743a5ec4eSXin LI#
1843a5ec4eSXin LI# I noticed that the magic database includes an entry for SPSS files but
1943a5ec4eSXin LI# not Stata files. Stata files for Stata 13 and newer (formats 117, 118,
2043a5ec4eSXin LI# and 119) always begin with the string "<stata_dta><header>" as per
2143a5ec4eSXin LI# https://www.stata.com/help.cgi?dta#definition
2243a5ec4eSXin LI#
2343a5ec4eSXin LI# The format version number always follows, for example:
2443a5ec4eSXin LI#    <stata_dta><header><release>117</release>
2543a5ec4eSXin LI#    <stata_dta><header><release>118</release>
2643a5ec4eSXin LI#
2743a5ec4eSXin LI# Therefore the following line would do the trick:
2843a5ec4eSXin LI#    0       string  <stata_dta><header>     Stata Data File
2943a5ec4eSXin LI#
3043a5ec4eSXin LI# (I'm sure the version number could be captured as well but I did not
3143a5ec4eSXin LI# manage this without a regex)
3243a5ec4eSXin LI#
3343a5ec4eSXin LI# Unfortunately the previous formats (created by Stata before 13, which
3443a5ec4eSXin LI# was released 2013) are harder to recognize. Format 115 starts with the
3543a5ec4eSXin LI# four bytes 0x73010100 or 0x73020100, format 114 with 0x72010100 or
3643a5ec4eSXin LI# 0x72020100, format 113 with 0x71010101 or 0x71020101.
3743a5ec4eSXin LI#
3843a5ec4eSXin LI# For additional reference, the Library of Congress website has an entry
3943a5ec4eSXin LI# for the Stata Data File Format 118:
4043a5ec4eSXin LI# https://www.loc.gov/preservation/digital/formats/fdd/fdd000471.shtml
4143a5ec4eSXin LI#
4243a5ec4eSXin LI# Example of those files can be found on Zenodo:
4343a5ec4eSXin LI# https://zenodo.org/search?page=1&size=20&q=&file_type=dta
4443a5ec4eSXin LI0	string	\<stata_dta\>\<header\>\<release\>	Stata Data File
45*a4d6d3b8SXin LI>&0	regex	[0-9]+					(Release %s)
46