README.md
1The gpb is a compiler for Google protocol buffer definitions files
2for Erlang.
3
4See https://developers.google.com/protocol-buffers/ for further information
5on the Google protocol buffers.
6
7Build Status
8------------
9
10[![Build Status](https://travis-ci.org/tomas-abrahamsson/gpb.svg?branch=master)](https://travis-ci.org/tomas-abrahamsson/gpb)
11
12New in version 4.0.0
13--------------------
14
15The default value for the `maps_unset_optional` option has changed
16to `omitted`, from `present_undefined` This concerns only code generated
17with the maps (-maps) options. Projects that already set this option
18explicitly are not impacted. Projects that relied on the default to be
19`present_undefined` will need to set the option explicitly in order to
20upgrade to 4.0.0.
21
22For type specs, the default has changed to generate them when possible. The
23option `{type_specs,false}` (-no_type) can be used to avoid generating type
24specs.
25
26
27Basic example of using gpb
28--------------------------
29
30Let's say we have a protobuf file, `x.proto`
31```protobuf
32message Person {
33 required string name = 1;
34 required int32 id = 2;
35 optional string email = 3;
36}
37```
38We can generate code for this definition in a number of different
39ways. Here we use the command line tool. For info on integration with
40rebar, see further down.
41```
42# .../gpb/bin/protoc-erl -I. x.proto
43```
44Now we've got `x.erl` and `x.hrl`. First we compile it and then we can
45try it out in the Erlang shell:
46```erlang
47# erlc -I.../gpb/include x.erl
48# erl
49Erlang/OTP 19 [erts-8.0.3] [source] [64-bit] [smp:12:12] [async-threads:10] [kernel-poll:false]
50
51Eshell V8.0.3 (abort with ^G)
521> rr("x.hrl").
53['Person']
542> x:encode_msg(#'Person'{name="abc def", id=345, email="a@example.com"}).
55<<10,7,97,98,99,32,100,101,102,16,217,2,26,13,97,64,101,
56 120,97,109,112,108,101,46,99,111,109>>
573> Bin = v(-1).
58<<10,7,97,98,99,32,100,101,102,16,217,2,26,13,97,64,101,
59 120,97,109,112,108,101,46,99,111,109>>
604> x:decode_msg(Bin, 'Person').
61#'Person'{name = "abc def",id = 345,email = "a@example.com"}
62```
63
64In the Erlang shell, the `rr("x.hrl")` reads record definitions, and
65the `v(-1)` references a value one step earlier in the history.
66
67Mapping of protocol buffer datatypes to erlang
68----------------------------------------------
69
70<table>
71<thead><tr><th>Protobuf type</th><th>Erlang type</th></tr></thead>
72<tbody>
73<!-- = = = = = = = = = = = = = = = = = = = = = = = = = = = -->
74<tr><td>double, float</td>
75 <td>float() | infinity | '-infinity' | nan<br/>
76 When encoding, integers, too, are accepted</td></tr>
77<!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - -->
78<tr><td> int32, int64<br/>
79 uint32, uint64<br/>
80 sint32, sint64<br/>
81 fixed32, fixed64<br/>
82 sfixed32, sfixed64</td>
83 <td>integer()</td></tr>
84<!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - -->
85<tr><td>bool</td>
86 <td>true | false<br/>
87 When encoding, the integers 1 and 0, too, are accepted</td></tr>
88<!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - -->
89<tr><td>enum</td>
90 <td>atom()<br/>
91 unknown enums decode to integer()</td></tr>
92<!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - -->
93<tr><td>message</td>
94 <td>record (thus tuple())<br/>
95 or map() if the maps (-maps) option is specified</td></tr>
96<!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - -->
97<tr><td>string</td>
98 <td>unicode string, thus list of integers<br/>
99 or binary() if the strings_as_binaries (-strbin) option is
100 specified<br/>
101 When encoding, iolists, too, are accepted</td></tr>
102<!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - -->
103<tr><td>bytes</td>
104 <td>binary()<br/>
105 When encoding, iolists, too, are accepted</td></tr>
106<!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - -->
107<tr><td>oneof</td>
108 <td><tt>{ChosenFieldName, Value}</tt><br/>
109 or <tt>ChosenFieldName => Value</tt> if the {maps_oneof,flat}
110 (-maps_oneof flat) option is specified (requires maps and
111 maps_unset_optional = omitted)</td></tr>
112<!-- - - - - - - - - - - - - - - - - - - - - - - - - - - - -->
113<tr><td>map<_,_></td>
114 <td>An unordered list of 2-tuples, <tt>[{Key,Value}]</tt><br/>
115 or a map(), if the maps (-maps) option is specified</td></tr>
116</tbody></table>
117
118
119Repeated fields are represented as lists.
120
121Optional fields are represented as either the value or `undefined` if
122not set. However, for maps, if the option `maps_unset_optional` is set
123to `omitted`, then unset optional values are omitted from the map,
124instead of being set to `undefined` when encoding messages. When
125decoding messages, even with `maps_unset_optional` set to `omitted,
126the default value will be set in the decoded map.
127
128Examples of Erlang format for protocol buffer messages
129------------------------------------------------------
130
131#### Repeated and required fields
132
133```protobuf
134 message m1 {
135 repeated uint32 i = 1;
136 required bool b = 2;
137 required eee e = 3;
138 required submsg sub = 4;
139 }
140 message submsg {
141 required string s = 1;
142 required bytes b = 2;
143 }
144 enum eee {
145 INACTIVE = 0;
146 ACTIVE = 1;
147 }
148```
149##### Corresponding Erlang
150```erlang
151 #m1{i = [17, 4711],
152 b = true,
153 e = 'ACTIVE',
154 sub = #submsg{s = "abc",
155 b = <<0,1,2,3,255>>}}
156
157 %% If compiled to with the option maps:
158 #{i => [17, 4711],
159 b => true,
160 e => 'ACTIVE',
161 sub => #{s => "abc",
162 b => <<0,1,2,3,255>>}}
163```
164
165#### Optional fields
166```protobuf
167 message m2 {
168 optional uint32 i1 = 1;
169 optional uint32 i2 = 2;
170 }
171```
172##### Corresponding Erlang
173```erlang
174 #m2{i1 = 17} % i2 is implicitly set to undefined
175
176 %% With the maps option
177 #{i1 => 17,
178 i2 => undefined}
179
180 %% With the maps option and the maps_unset_optional set to omitted:
181 #{i1 => 17}
182```
183
184#### Oneof fields
185This construct first appeared in Google protobuf version 2.6.0.
186```protobuf
187 message m3 {
188 oneof u {
189 int32 a = 1;
190 string b = 2;
191 }
192 }
193```
194##### Corresponding Erlang
195A oneof field is automatically always optional.
196```erlang
197 #m3{u = {a, 17}}
198 #m3{u = {b, "hello"}}
199 #m3{} % u is implicitly set to undefined
200
201 %% With the maps option
202 #{u => {a, 17}}
203 #{u => {b, "hello"}}
204 #{} % If maps_unset_optional = omitted (default)
205 #{u => undefined} % With maps_unset_optional set to present_undefined
206
207 %% With the {maps_oneof,flat} option (requires maps_unset_optional = omitted)
208 #{a => 17}
209 #{b => "hello"}
210 #{}
211
212```
213
214#### Map fields
215Not to be confused with Erlang maps.
216This construct first appeared in Google protobuf version 3.0.0 (for
217both the `proto2` and the `proto3` syntax)
218```protobuf
219 message m4 {
220 map<uint32,string> f = 1;
221 }
222```
223##### Corresponding Erlang
224For records, the order of items is undefined when decoding.
225```erlang
226 #m4{f = []}
227 #m4{f = [{1, "a"}, {2, "b"}, {13, "hello"}]}
228
229 %% With the maps option
230 #{f => #{}}
231 #{f => #{1 => "a", 2 => "b", 13 => "hello"}}
232```
233
234
235Unset optionals and the `default` option
236----------------------------------------
237
238#### For proto2 syntax
239
240This describes how decoding works for optional fields that are
241not present in the binary-to-decode.
242
243The documentation for Google protobuf says these decode to the default
244value if specified, or else to the field's type-specific default. The
245code generated by Google's protobuf compiler also contains
246`has_<field>()` methods so one can examine whether a field was
247actually present or not.
248
249However, in Erlang, the natural way to set and read fields is to just
250use the syntax for records (or maps), and this leaves no good way to
251at the same time both convey whether a field was present or not and to
252read the defaults.
253
254So the approach in `gpb` is that you have to choose: either or.
255Normally, it is possible to see whether an optional field is
256present or not, eg by checking if the value is `undefined`. But there
257are options to the compiler to instead decode to defaults, in which
258case you lose the ability to see whether a field is present or not.
259The options are `defaults_for_omitted_optionals` and
260`type_defaults_for_omitted_optionals`, for decoding to `default=<x>`
261values, or to type-specific defaults respectively.
262
263It works this way:
264
265```protobuf
266message o1 {
267 optional uint32 a = 1 [default=33];
268 optional uint32 b = 2; // the type-specific default is 0
269}
270```
271
272Given binary data `<<>>`, that is, neither field `a` nor `b` is present,
273then the call `decode_msg(Input, o1)` results in:
274
275```erlang
276#o1{a=undefined, b=undefined} % None of the options
277
278#o1{a=33, b=undefined} % with option defaults_for_omitted_optionals
279
280#o1{a=33, b=0} % with both defaults_for_omitted_optionals
281 % and type_defaults_for_omitted_optionals
282
283#o1{a=0, b=0} % with only type_defaults_for_omitted_optionals
284```
285The last of the alternatives is perhaps not very useful, but still
286possible, and implemented for completeness.
287
288[Google's Reference](https://developers.google.com/protocol-buffers/docs/proto#optional)
289
290#### For proto3 syntax
291
292For proto3, there is neither `required` nor `optional` nor
293`default=<x>` for fields. Instead all fields are implicitly optional,
294and if missing in the binary to decode, they always decode to the
295type-specific default value. Also, it is not possible to determine
296whether a value was present---with a type-specific value---or not; no
297`has_<field>()` methods are generated (at least for scalars). If you
298need detection of "missing" data, you must define `has_<field>`
299boolean fields and set them appropriately.
300
301This maps directly and naturally to Erlang.
302
303Features of gpb
304---------------
305
306* Parses protocol buffer definition files and can generate:
307 - record definitions, one record for each message
308 - erlang code for encoding/decoding the messages to/from binaries
309
310* Features of the protocol buffer definition files:
311 gpb supports:
312 - message definitions (also messages in messages)
313 - scalar types
314 - importing other proto files
315 - nested types
316 - message extensions
317 - the `packed` and `default` options for fields
318 - the `allow_alias` enum option (treated as if it is always set true)
319 - generating metadata information
320 - package namespacing (optional)
321 - `oneof` (introduced in protobuf 2.6.0)
322 - `map<_,_>` (introduced in protobuf 3.0.0)
323 - proto3 support:
324 - syntax and general semantics
325 - import of well-known types
326 - Callback functions can be specified for automatically translating
327 google.protobuf.Any messages
328 - groups
329
330 gpb reads but ignores or throws away:
331 - options other than `packed` or `default`
332 - custom options
333
334 gpb does not support:
335 - aggregate custom options introduced in protobuf 2.4.0
336 - rpc
337 - proto3 JSON mapping
338
339* Characteristics of gpb:
340 - Skipping over unknown message fields or groups, when decoding,
341 is supported
342 - Merging of messages, also recursive merging, is supported
343 - Gpb can optionally generate code for verification of values during
344 encoding this makes it easy to catch e.g integers out of range,
345 or values of the wrong type.
346 - Gpb can optionally or conditionally copy the contents of `bytes`
347 fields, in order to let the runtime system free the larger message
348 binary.
349 - Gpb can optionally make use of the `package` attribute by prepending
350 the name of the package to every contained message type (if defined),
351 which is useful to avoid name clashes of message types across packages.
352 - The generated encode/decoder has no run-time dependency to gpb,
353 but there is normally a compile-time dependency for the generated
354 code: to the `#field{}` record in gpb.hrl the for the `get_msg_defs`
355 function, but it is possible to avoid this dependency by using
356 the also the `defs_as_proplists` or `-pldefs` option.
357 - Gpb can generate code both to files and to binaries.
358 - Proto input files are expected to be UTF-8, but the file reader
359 will fall back to decode the files as latin1 in UTF-8 decode errors,
360 for backwards compatibility and behaviour that most closely
361 emulates what Google protobuf does.
362
363* Introspection
364
365 gpb generates some functions for examining messages, enums and services:
366 - `get_msg_defs()`, `get_msg_names()`, `get_enum_names()`
367 - `find_msg_def(MsgName)` and `fetch_msg_def(MsgName)`
368 - `find_enum_def(MsgName)` and `fetch_enum_def(MsgName)`
369 - `enum_symbol_by_value(EnumName, Value)`,
370 - `enum_symbol_by_value_<EnumName>(Value)`,
371 `enum_value_by_symbol(EnumName, Enum)` and
372 `enum_value_by_symbol_<EnumName>(Enum)`
373 - `get_service_names()`, `get_service_def(ServiceName)`, `get_rpc_names(ServiceName)`
374 - `find_rpc_def(ServiceName, RpcName)`, `fetch_rpc_def(ServiceName, RpcName)`
375
376 There are also some version information functions:
377
378 - `gpb:version_as_string()` and `gpb:version_as_list()`
379 - `GeneratedCode:version_as_string()` and `GeneratedCode:version_as_list()`
380 - `?gpb_version` (in gpb_version.hrl)
381 - `?'GeneratedCode_gpb_version'` (in GeneratedCode.hrl)
382
383 The gpb can also generate a self-description of the proto file.
384 The self-description is a description of the proto file, encoded to
385 a binary using the descriptor.proto that comes with the Google
386 protocol buffers library. Note that such an encoded self-descriptions
387 won't be byte-by-byte identical to what the Google protocol buffers
388 compiler will generate for the same proto, but should be roughly
389 equivalent.
390
391* Erroneously encoded protobuf messages and fields will generally
392 cause the decoder to crash. Examples of such erroneous encodings are:
393 - varints with too many bits
394 - strings, bytes, sub messages or packed repeated fields,
395 where the encoded length is longer than the remaining binary
396
397* Maps
398
399 Gpb can generate encoders/decoders for maps.
400
401 The option `maps_unset_optional` can be used to specify behavior
402 for non-present optional fields: whether they are omitted from
403 maps, or whether they are present, but have the value `undefined`
404 like for records.
405
406* Reporting of errors in .proto files
407
408 Gpb is not very good at error reporting, especially referencing
409 errors, such as references to messages that are not defined.
410 You might want to first verify with `protoc` that the .proto files
411 are valid before feeding them to gpb.
412
413* Caveats
414
415 The gpb does accept reserved words as names for fields (just like
416 protoc does), but not as names for messages. To correct this, one
417 would have to either rewrite the grammar, or stop using yecc.
418 (maybe rewrite it all as a protoc plugin?)
419
420Interaction with rebar
421----------------------
422
423For info on how to use gpb with rebar3, see
424https://www.rebar3.org/docs/using-available-plugins#section-protocol-buffers
425
426In rebar there is support for gpb since version 2.6.0. See the
427proto compiler section of rebar.sample.config file at
428https://github.com/rebar/rebar/blob/master/rebar.config.sample
429
430For older versions of rebar---prior to 2.6.0---the text below outlines
431how to proceed:
432
433Place the .proto files for instance in a `proto/` subdirectory.
434Any subdirectory, other than src/, is fine, since rebar will try to
435use another protobuf compiler for any .proto it finds in the src/
436subdirectory. Here are some some lines for the `rebar.config` file:
437
438 %% -*- erlang -*-
439 {pre_hooks,
440 [{compile, "mkdir -p include"}, %% ensure the include dir exists
441 {compile,
442 "/path/to/gpb/bin/protoc-erl -I`pwd`/proto"
443 "-o-erl src -o-hrl include `pwd`/proto/*.proto"
444 }]}.
445
446 {post_hooks,
447 [{clean,
448 "bash -c 'for f in proto/*.proto; "
449 "do "
450 " rm -f src/$(basename $f .proto).erl; "
451 " rm -f include/$(basename $f .proto).hrl; "
452 "done'"}
453 ]}.
454
455 {erl_opts, [{i, "/path/to/gpb/include"}]}.
456
457
458Performance
459-----------
460
461Here is a comparison between gpb (interpreted by the erlang vm) and
462the C++, Python and Java serializers/deserializers of protobuf-2.6.1rc1
463
464 [MB/s] | gpb |pb/c++ |pb/c++ | pb/c++ | pb/py |pb/java| pb/java|
465 | |(speed)|(size) | (lite) | |(size) | (speed)|
466 --------------+-------+-------+-------+--------+-------+-------+--------+
467 small msgs | | | | | | | |
468 serialize | 52 | 1240 | 85 | 750 | 6.5 | 68 | 1290 |
469 deserialize | 63 | 880 | 85 | 950 | 5.5 | 90 | 450 |
470 --------------+-------+-------+-------+--------+-------+-------+--------+
471 large msgs | | | | | | | |
472 serialize | 36 | 950 | 72 | 670 | 4.5 | 55 | 670 |
473 deserialize | 54 | 620 | 71 | 480 | 4.0 | 60 | 360 |
474 --------------+-------+-------+-------+--------+-------+-------+--------+
475
476The performances are measured as number of processed MB/s,
477serialized form. Higher values means better performance.
478
479The benchmarks are run with small and large messages (228 and 84584
480bytes, respectively, in serialized form)
481
482The Java benchmark is run with optimization both for code size and for
483speed. The Python implementation cannot optimize for speed.
484
485 SW: Python 2.7.11, Java 1.8.0_77 (Oracle JDK), Erlang/OTP 18.3, g++ 5.3.1
486 Linux kernel 4.4, Debian (in 64 bit mode), protobuf-2.6.1rc1
487 HW: Intel Core i7 5820k, 3.3GHz, 6x256 kB L2 cache, 15MB L3 cache
488 (CPU frequency pinned to 3.3 GHz)
489
490The benchmarks are all done with the exact same messages files and
491proto files. The source of the benchmarks was found in the Google
492protobuf's svn repository. The gpb originally did not support groups,
493and the benchmarks in the protobuf used groups, so I converted the
494google_message*.dat to use sub message structures instead.
495For protobuf, that change was only barely noticeable.
496
497For performance, the generated Erlang code avoids creating sub
498binaries as far as possible. It has to for sub messages, strings and
499bytes, but for the rest of the types, it avoids creating sub binaries,
500both during encoding and decoding (for info, compile with the
501`bin_opt_info` option)
502
503The Erlang code ran in the smp emulator, though only one CPU core
504was utilized.
505
506The generated C++ core was compiled with -O3.
507
508
509Version numbering
510-----------------
511
512The gpb version number is fetched from the git latest git tag
513matching N.M where N and M are integers. This version is
514inserted into the gpb.app file as well as into the
515include/gpb_version.hrl. The version is the result of the command
516
517 git describe --always --tags --match '[0-9]*.[0-9]*'
518
519Thus, to create a new version of gpb, the single source from where
520this version is fetched, is the git tag. (If you are importing
521gpb into another version control system than git, or using another
522build tool than rebar, you might have to adapt rebar.config and
523src/gpb.app.src accordingly. See also the section below about
524[building outside of a git work tree](README.md#building-outside-of-a-git-work-tree) for info on
525exporting gpb from git.)
526
527The version number on the master branch of the gpb on github is
528intended to always be only integers with dots, in order to be
529compatible with reltool. In other words, each push to github is
530considered a release, and the version number is bumped. To ensure
531this, there is a `pre-push` git hook and two scripts,
532`install-git-hooks` and `tag-next-minor-vsn`, in the helpers
533subdirectory. The ChangeLog file will not necessarily reflect all
534minor version bumps, only important updates.
535
536Places to update when making a new version:
537* Write about the changes in the ChangeLog file,
538 if it is a non-minor version bump.
539* tag it in git
540
541
542Building outside of a git work tree
543-----------------------------------
544
545The gpb build process requires a git work tree, with tags, to get the
546version numbering right, as described in the
547[Version numbering section](REAMDE.md#version-numbering). To export gpb
548for building outside of a git work tree, run the
549`helpers/export-from-git` script from a git work tree. The export script
550will create a tar file with the version number already substituted.
551
552In particular, the initial requirement on a git work tree to get the
553version number right, unfortunately also means that it does not work
554to build gitbub's automatically created release tar balls.
555
556
557Related projects
558----------------
559* [rebar3_plugin_gpb](https://github.com/lrascao/rebar3_gpb_plugin) for
560 using gpb with rebar3
561* [exprotobuf](https://github.com/bitwalker/exprotobuf) for using gpb from
562 [Elixir](http://elixir-lang.org)
563* [enif_protobuf](https://github.com/jg513/enif_protobuf) for a NIF
564 encoder/decoder
565
566
567Contributing
568------------
569
570Contributions are welcome, preferably as pull requests or git patches
571or git fetch requests. Here are some guide lines:
572
573* Use only spaces for indentation, no tabs. Indentation is 4 spaces.
574* The code must fit 80 columns
575* Verify that the code and documentation compiles and that tests are ok:
576 rebar clean compile eunit doc xref
577* If you add a feature, test cases are most welcome,
578 so that the feature won't get lost in any future refactorization
579* Use a git branch for your feature. This way, the git history will
580 look better in case there is need to refetch.
581