1BERT
2----
3
4:download:`Download scripts </model_zoo/bert.zip>`
5
6
7Reference: Devlin, Jacob, et al. "`Bert: Pre-training of deep bidirectional transformers for language understanding. <https://arxiv.org/abs/1810.04805>`_" arXiv preprint arXiv:1810.04805 (2018).
8
9BERT Model Zoo
10~~~~~~~~~~~~~~
11
12The following pre-trained BERT models are available from the **gluonnlp.model.get_model** API:
13
14+-----------------------------------------+----------------+-----------------+
15|                                         | bert_12_768_12 | bert_24_1024_16 |
16+=========================================+================+=================+
17| book_corpus_wiki_en_uncased             | ✓              | ✓               |
18+-----------------------------------------+----------------+-----------------+
19| book_corpus_wiki_en_cased               | ✓              | ✓               |
20+-----------------------------------------+----------------+-----------------+
21| openwebtext_book_corpus_wiki_en_uncased | ✓              | x               |
22+-----------------------------------------+----------------+-----------------+
23| wiki_multilingual_uncased               | ✓              | x               |
24+-----------------------------------------+----------------+-----------------+
25| wiki_multilingual_cased                 | ✓              | x               |
26+-----------------------------------------+----------------+-----------------+
27| wiki_cn_cased                           | ✓              | x               |
28+-----------------------------------------+----------------+-----------------+
29| scibert_scivocab_uncased                | ✓              | x               |
30+-----------------------------------------+----------------+-----------------+
31| scibert_scivocab_cased                  | ✓              | x               |
32+-----------------------------------------+----------------+-----------------+
33| scibert_basevocab_uncased               | ✓              | x               |
34+-----------------------------------------+----------------+-----------------+
35| scibert_basevocab_cased                 | ✓              | x               |
36+-----------------------------------------+----------------+-----------------+
37| biobert_v1.0_pmc_cased                  | ✓              | x               |
38+-----------------------------------------+----------------+-----------------+
39| biobert_v1.0_pubmed_cased               | ✓              | x               |
40+-----------------------------------------+----------------+-----------------+
41| biobert_v1.0_pubmed_pmc_cased           | ✓              | x               |
42+-----------------------------------------+----------------+-----------------+
43| biobert_v1.1_pubmed_cased               | ✓              | x               |
44+-----------------------------------------+----------------+-----------------+
45| clinicalbert_uncased                    | ✓              | x               |
46+-----------------------------------------+----------------+-----------------+
47| kobert_news_wiki_ko_cased               | ✓              | x               |
48+-----------------------------------------+----------------+-----------------+
49
50where **bert_12_768_12** refers to the BERT BASE model, and **bert_24_1024_16** refers to the BERT LARGE model.
51
52.. code-block:: python
53
54    import gluonnlp as nlp; import mxnet as mx;
55    model, vocab = nlp.model.get_model('bert_12_768_12', dataset_name='book_corpus_wiki_en_uncased', use_classifier=False, use_decoder=False);
56    tokenizer = nlp.data.BERTTokenizer(vocab, lower=True);
57    transform = nlp.data.BERTSentenceTransform(tokenizer, max_seq_length=512, pair=False, pad=False);
58    sample = transform(['Hello world!']);
59    words, valid_len, segments = mx.nd.array([sample[0]]), mx.nd.array([sample[1]]), mx.nd.array([sample[2]]);
60    seq_encoding, cls_encoding = model(words, segments, valid_len);
61
62
63The pretrained parameters for dataset_name
64'openwebtext_book_corpus_wiki_en_uncased' were obtained by running the GluonNLP
65BERT pre-training script on OpenWebText.
66
67The pretrained parameters for dataset_name 'scibert_scivocab_uncased',
68'scibert_scivocab_cased', 'scibert_basevocab_uncased', 'scibert_basevocab_cased'
69were obtained by converting the parameters published by "Beltagy, I., Cohan, A.,
70& Lo, K. (2019). Scibert: Pretrained contextualized embeddings for scientific
71text. arXiv preprint `arXiv:1903.10676 <https://arxiv.org/abs/1903.10676>`_."
72
73The pretrained parameters for dataset_name 'biobert_v1.0_pmc',
74'biobert_v1.0_pubmed', 'biobert_v1.0_pubmed_pmc', 'biobert_v1.1_pubmed' were
75obtained by converting the parameters published by "Lee, J., Yoon, W., Kim, S.,
76Kim, D., Kim, S., So, C. H., & Kang, J. (2019). Biobert: pre-trained biomedical
77language representation model for biomedical text mining. arXiv preprint
78`arXiv:1901.08746 <https://arxiv.org/abs/1901.08746>`_."
79
80The pretrained parameters for dataset_name 'clinicalbert' were obtained by
81converting the parameters published by "Huang, K., Altosaar, J., & Ranganath, R.
82(2019). ClinicalBERT: Modeling Clinical Notes and Predicting Hospital
83Readmission. arXiv preprint `arXiv:1904.05342
84<https://arxiv.org/abs/1904.05342>`_."
85
86Additionally, GluonNLP supports the "`RoBERTa <https://arxiv.org/abs/1907.11692>`_" model:
87
88+-----------------------------------------+-------------------+--------------------+
89|                                         | roberta_12_768_12 | roberta_24_1024_16 |
90+=========================================+===================+====================+
91| openwebtext_ccnews_stories_books_cased  | ✓                 | ✓                  |
92+-----------------------------------------+-------------------+--------------------+
93
94.. code-block:: python
95
96    import gluonnlp as nlp; import mxnet as mx;
97    model, vocab = nlp.model.get_model('roberta_12_768_12', dataset_name='openwebtext_ccnews_stories_books_cased', use_decoder=False);
98    tokenizer = nlp.data.GPT2BPETokenizer();
99    text = [vocab.bos_token] + tokenizer('Hello world!') + [vocab.eos_token];
100    seq_encoding = model(mx.nd.array([vocab[text]]))
101
102GluonNLP also supports the "`DistilBERT <https://arxiv.org/abs/1910.01108>`_" model:
103
104+-----------------------------------------+----------------------+
105|                                         | distilbert_6_768_12  |
106+=========================================+======================+
107| distil_book_corpus_wiki_en_uncased      | ✓                    |
108+-----------------------------------------+----------------------+
109
110.. code-block:: python
111
112    import gluonnlp as nlp; import mxnet as mx;
113    model, vocab = nlp.model.get_model('distilbert_6_768_12', dataset_name='distil_book_corpus_wiki_en_uncased');
114    tokenizer = nlp.data.BERTTokenizer(vocab, lower=True);
115    transform = nlp.data.BERTSentenceTransform(tokenizer, max_seq_length=512, pair=False, pad=False);
116    sample = transform(['Hello world!']);
117    words, valid_len = mx.nd.array([sample[0]]), mx.nd.array([sample[1]])
118    seq_encoding, cls_encoding = model(words, valid_len);
119
120Finally, GluonNLP also suports Korean BERT pre-trained model, "`KoBERT <https://github.com/SKTBrain/KoBERT>`_", using Korean wiki dataset (`kobert_news_wiki_ko_cased`).
121
122.. code-block:: python
123
124    import gluonnlp as nlp; import mxnet as mx;
125    model, vocab = nlp.model.get_model('bert_12_768_12', dataset_name='kobert_news_wiki_ko_cased',use_decoder=False, use_classifier=False)
126    tok = nlp.data.get_tokenizer('bert_12_768_12', 'kobert_news_wiki_ko_cased')
127    tok('안녕하세요.')
128
129.. hint::
130
131   The pre-training, fine-tuning and export scripts are available `here. </_downloads/bert.zip>`__
132
133
134Sentence Classification
135~~~~~~~~~~~~~~~~~~~~~~~
136
137GluonNLP provides the following example script to fine-tune sentence classification with pre-trained
138BERT model.
139
140To enable mixed precision training with float16, set `--dtype` argument to `float16`.
141
142Results using `bert_12_768_12`:
143
144.. editing URL for the following table: https://tinyurl.com/y4n8q84w
145
146+-----------------+---------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
147|Task Name        |Metrics              |Results on Dev Set     |log                                                                                                                                         |command                                                                                                                                                          |
148+=================+=====================+=======================+============================================================================================================================================+=================================================================================================================================================================+
149| CoLA            |Matthew Corr.        |60.32                  |`log <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_CoLA_base_mx1.6.0rc1.log>`__                                 |`command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_CoLA_base_mx1.6.0rc1.sh>`__                                                   |
150+-----------------+---------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
151| SST-2           |Accuracy             |93.46                  |`log <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_SST_base_mx1.6.0rc1.log>`__                                  |`command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_SST_base_mx1.6.0rc1.sh>`__                                                    |
152+-----------------+---------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
153| MRPC            |Accuracy/F1          |88.73/91.96            |`log <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_MRPC_base_mx1.6.0rc1.log>`__                                 |`command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_MRPC_base_mx1.6.0rc1.sh>`__                                                   |
154+-----------------+---------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
155| STS-B           |Pearson Corr.        |90.34                  |`log <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_STS-B_base_mx1.6.0rc1.log>`__                                |`command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_STS-B_base_mx1.6.0rc1.sh>`__                                                  |
156+-----------------+---------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
157| QQP             |Accuracy             |91                     |`log <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_QQP_base_mx1.6.0rc1.log>`__                                  |`command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_QQP_base_mx1.6.0rc1.sh>`__                                                    |
158+-----------------+---------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
159| MNLI            |Accuracy(m/mm)       |84.29/85.07            |`log <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_MNLI_base_mx1.6.0rc1.log>`__                                 |`command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_MNLI_base_mx1.6.0rc1.sh>`__                                                   |
160+-----------------+---------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
161| XNLI (Chinese)  |Accuracy             |78.43                  |`log <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_XNLI_base_mx1.6.0rc1.log>`__                                 |`command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_XNLI-B_base_mx1.6.0rc1.sh>`__                                                 |
162+-----------------+---------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
163| RTE             |Accuracy             |74                     |`log <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_RTE_base_mx1.6.0rc1.log>`__                                  |`command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_RTE_base_mx1.6.0rc1.sh>`__                                                    |
164+-----------------+---------------------+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
165
166
167
168Results using `roberta_12_768_12`:
169
170.. editing URL for the following table: https://www.shorturl.at/cjAO7
171
172+---------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
173| Dataset             | SST-2                                                                                                | MNLI-M/MM                                                                                                        |
174+=====================+======================================================================================================+==================================================================================================================+
175| Validation Accuracy | 95.3%                                                                                                | 87.69%, 87.23%                                                                                                   |
176+---------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
177| Log                 | `log  <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/roberta/finetuned_sst.log>`__      | `log <https://raw.githubusercontent.com/dmlc/web-data/master/gluonnlp/logs/roberta/mnli_1e-5-32.log>`__          |
178+---------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
179| Command             | `command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/roberta/finetuned_sst.sh>`__    | `command  <https://raw.githubusercontent.com/dmlc/web-data/master/gluonnlp/logs/roberta/finetuned_mnli.sh>`__    |
180+---------------------+------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------+
181
182.. editing URL for the following table: https://tinyurl.com/y5rrowj3
183
184Question Answering on SQuAD
185~~~~~~~~~~~~~~~~~~~~~~~~~~~
186
187+-----------+-----------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
188| Dataset   | SQuAD 1.1                                                                                                                               | SQuAD 1.1                                                                                                                                | SQuAD 2.0                                                                                                                                |
189+===========+=========================================================================================================================================+==========================================================================================================================================+==========================================================================================================================================+
190| Model     | bert_12_768_12                                                                                                                          | bert_24_1024_16                                                                                                                          | bert_24_1024_16                                                                                                                          |
191+-----------+-----------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
192| F1 / EM   | 88.58 / 81.26                                                                                                                           | 90.97 / 84.22                                                                                                                            | 81.27 / 78.14                                                                                                                            |
193+-----------+-----------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
194| Log       | `log <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_squad1.1_base_mx1.6.0rc1.log>`__                         | `log <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_squad1.1_large_mx1.6.0rc1.log>`__                         | `log <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_squad2.0_large_mx1.6.0rc1.log>`__                         |
195+-----------+-----------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
196| Command   | `command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_squad1.1_base_mx1.6.0rc1.sh>`__                      | `command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_squad1.1_large_mx1.6.0rc1.sh>`__                      | `command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_squad2.0_large_mx1.6.0rc1.sh>`__                      |
197+-----------+-----------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
198| Prediction| `predictions.json <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_squad1.1_base_mx1.6.0rc1.json>`__           | `predictions.json <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_squad1.1_large_mx1.6.0rc1.json>`__           | `predictions.json <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/finetune_squad2.0_large_mx1.6.0rc1.json>`__           |
199+-----------+-----------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------+
200
201For all model settings above, we set learing rate = 3e-5 and optimizer = adam.
202
203Note that the BERT model is memory-consuming. If you have limited GPU memory, you can use the following command to accumulate gradient to achieve the same result with a large batch size by setting *accumulate* and *batch_size* arguments accordingly.
204
205.. code-block:: console
206
207    $ python finetune_squad.py --optimizer adam --accumulate 2 --batch_size 6 --lr 3e-5 --epochs 2 --gpu
208
209We support multi-GPU training via horovod:
210
211.. code-block:: console
212
213    $ HOROVOD_WITH_MXNET=1 HOROVOD_GPU_ALLREDUCE=NCCL pip install horovod --user --no-cache-dir
214    $ horovodrun -np 8 python finetune_squad.py --bert_model bert_24_1024_16 --batch_size 4 --lr 3e-5 --epochs 2 --gpu --dtype float16 --comm_backend horovod
215
216SQuAD 2.0
217+++++++++
218
219For SQuAD 2.0, you need to specify the parameter *version_2* and specify the parameter *null_score_diff_threshold*. Typical values are between -1.0 and -5.0. Use the following command to fine-tune the BERT large model on SQuAD 2.0 and generate predictions.json.
220
221To get the score of the dev data, you need to download the dev dataset (`dev-v2.0.json <https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json>`_) and the evaluate script (`evaluate-2.0.py <https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/>`_). Then use the following command to get the score of the dev dataset.
222
223.. code-block:: console
224
225    $ python evaluate-v2.0.py dev-v2.0.json predictions.json
226
227BERT INT8 Quantization
228~~~~~~~~~~~~~~~~~~~~~~
229
230GluonNLP provides the following example scripts to quantize fine-tuned
231BERT models into int8 data type. Note that INT8 Quantization needs a nightly
232version of `mxnet-mkl <https://apache-mxnet.s3-us-west-2.amazonaws.com/dist/index.html>`_.
233
234Sentence Classification
235+++++++++++++++++++++++
236
237+-----------+-------------------+---------------+---------------+---------+---------+------------------------------------------------------------------------------------------------------------------------+
238|  Dataset  | Model             | FP32 Accuracy | INT8 Accuracy | FP32 F1 | INT8 F1 | Command                                                                                                                |
239+===========+===================+===============+===============+=========+=========+========================================================================================================================+
240| MRPC      | bert_12_768_12    | 87.01         | 87.01         | 90.97   | 90.88   |`command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/calibration_MRPC_base_mx1.6.0b20200125.sh>`__ |
241+-----------+-------------------+---------------+---------------+---------+---------+------------------------------------------------------------------------------------------------------------------------+
242| SST-2     | bert_12_768_12    | 93.23         | 93.00         |         |         |`command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/calibration_SST_base_mx1.6.0b20200125.sh>`__  |
243+-----------+-------------------+---------------+---------------+---------+---------+------------------------------------------------------------------------------------------------------------------------+
244
245Question Answering
246++++++++++++++++++
247
248+-----------+-------------------+---------+---------+---------+---------+----------------------------------------------------------------------------------------------------------------------------+
249|  Dataset  | Model             | FP32 EM | INT8 EM | FP32 F1 | INT8 F1 | Command                                                                                                                    |
250+===========+===================+=========+=========+=========+=========+============================================================================================================================+
251| SQuAD 1.1 | bert_12_768_12    | 81.18   | 80.32   | 88.58   | 88.10   |`command <https://github.com/dmlc/web-data/blob/master/gluonnlp/logs/bert/calibration_squad1.1_base_mx1.6.0b20200125.sh>`__ |
252+-----------+-------------------+---------+---------+---------+---------+----------------------------------------------------------------------------------------------------------------------------+
253
254For all model settings above, we use a subset of evaluation dataset for calibration.
255
256Pre-training from Scratch
257~~~~~~~~~~~~~~~~~~~~~~~~~
258
259We also provide scripts for pre-training BERT with masked language modeling and and next sentence prediction.
260
261The pre-training data format expects: (1) One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text for the "next sentence prediction" task. (2) Blank lines between documents. You can find a sample pre-training text with 3 documents `here <https://github.com/dmlc/gluon-nlp/blob/master/scripts/bert/sample_text.txt>`__. You can perform sentence segmentation with an off-the-shelf NLP toolkit such as NLTK.
262
263
264.. hint::
265
266   You can download pre-processed English wikipedia dataset `here. <https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/enwiki-197b5d8d.zip>`__
267
268
269Pre-requisite
270+++++++++++++
271
272We recommend horovod for scalable multi-gpu multi-machine training.
273
274To install horovod, you need:
275
276- `NCCL <https://developer.nvidia.com/nccl>`__, and
277- `OpenMPI <https://www.open-mpi.org/software/ompi/v4.0/>`__
278
279Then you can install horovod via the following command:
280
281.. code-block:: console
282
283    $ HOROVOD_WITH_MXNET=1 HOROVOD_GPU_ALLREDUCE=NCCL pip install horovod==0.16.2 --user --no-cache-dir
284
285Run Pre-training
286++++++++++++++++
287
288You can use the following command to run pre-training with 2 hosts, 8 GPUs each:
289
290.. code-block:: console
291
292    $ mpirun -np 16 -H host0_ip:8,host1_ip:8 -mca pml ob1 -mca btl ^openib \
293             -mca btl_tcp_if_exclude docker0,lo --map-by ppr:4:socket \
294             --mca plm_rsh_agent 'ssh -q -o StrictHostKeyChecking=no' \
295             -x NCCL_MIN_NRINGS=8 -x NCCL_DEBUG=INFO -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 \
296             -x MXNET_SAFE_ACCUMULATION=1 --tag-output \
297             python run_pretraining.py --data='folder1/*.txt,folder2/*.txt,' \
298             --data_eval='dev_folder/*.txt,' --num_steps 1000000 \
299             --lr 1e-4 --total_batch_size 256 --accumulate 1 --raw --comm_backend horovod
300
301If you see out-of-memory error, try increasing --accumulate for gradient accumulation.
302
303When multiple hosts are present, please make sure you can ssh to these nodes without password.
304
305Alternatively, if horovod is not available, you could run pre-training with the MXNet native parameter server by setting --comm_backend and --gpus.
306
307.. code-block:: console
308
309    $ MXNET_SAFE_ACCUMULATION=1 python run_pretraining.py --comm_backend device --gpus 0,1,2,3,4,5,6,7 ...
310
311The BERT base model produced by gluonnlp pre-training script (`log <https://raw.githubusercontent.com/dmlc/web-data/master/gluonnlp/logs/bert/bert_base_pretrain.log>`__) achieves 83.6% on MNLI-mm, 93% on SST-2, 87.99% on MRPC and 80.99/88.60 on SQuAD 1.1 validation set on the books corpus and English wikipedia dataset.
312
313Custom Vocabulary
314+++++++++++++++++
315
316The pre-training script supports subword tokenization with a custom vocabulary using `sentencepiece <https://github.com/google/sentencepiece>`__.
317
318To install sentencepiece, run:
319
320.. code-block:: console
321
322    $ pip install sentencepiece==0.1.82 --user
323
324You can `train <//github.com/google/sentencepiece/tree/v0.1.82/python#model-training>`__ a custom sentencepiece vocabulary by specifying the vocabulary size:
325
326.. code-block:: python
327
328    import sentencepiece as spm
329    spm.SentencePieceTrainer.Train('--input=a.txt,b.txt --unk_id=0 --pad_id=3 --model_prefix=my_vocab --vocab_size=30000 --model_type=BPE')
330
331To use sentencepiece vocab for pre-training, please set --sentencepiece=my_vocab.model when using run_pretraining.py.
332
333
334
335Export BERT for Deployment
336~~~~~~~~~~~~~~~~~~~~~~~~~~
337
338Current export.py support exporting BERT models. Supported values for --task argument include classification, regression and question answering.
339
340.. code-block:: console
341
342    $ python export.py --task classification --model_parameters /path/to/saved/ckpt.params --output_dir /path/to/output/dir/ --seq_length 128
343
344This will export the BERT model for classification to a symbol.json file, saved to the directory specified by --output_dir.
345The --model_parameters argument is optional. If not set, the .params file saved in the output directory will be randomly initialized parameters.
346
347BERT for Sentence or Tokens Embedding
348~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
349
350The goal of this BERT Embedding is to obtain the token embedding from BERT's pre-trained model. In this way, instead of building and do fine-tuning for an end-to-end NLP model, you can build your model by just utilizing the token embeddings. You can use the command line interface below:
351
352.. code-block:: shell
353
354    python embedding.py --sentences "GluonNLP is a toolkit that enables easy text preprocessing, datasets loading and neural models building to help you speed up your Natural Language Processing (NLP) research."
355    Text: g ##lu ##on ##nl ##p is a tool ##kit that enables easy text prep ##ro ##ces ##sing , data ##set ##s loading and neural models building to help you speed up your natural language processing ( nl ##p ) research .
356    Tokens embedding: [array([-0.11881411, -0.59530115,  0.627092  , ...,  0.00648153,
357       -0.03886228,  0.03406909], dtype=float32), array([-0.7995638 , -0.6540758 , -0.00521846, ..., -0.42272145,
358       -0.5787281 ,  0.7021201 ], dtype=float32), array([-0.7406778 , -0.80276626,  0.3931962 , ..., -0.49068323,
359       -0.58128357,  0.6811132 ], dtype=float32), array([-0.43287313, -1.0018158 ,  0.79617643, ..., -0.26877284,
360       -0.621779  , -0.2731115 ], dtype=float32), array([-0.8515188 , -0.74098676,  0.4427735 , ..., -0.41267148,
361       -0.64225197,  0.3949393 ], dtype=float32), array([-0.86652845, -0.27746758,  0.8806506 , ..., -0.87452525,
362       -0.9551989 , -0.0786318 ], dtype=float32), array([-1.0987284 , -0.36603633,  0.2826037 , ..., -0.33794224,
363       -0.55210876, -0.09221527], dtype=float32), array([-0.3483025 ,  0.401534  ,  0.9361341 , ..., -0.29747447,
364       -0.49559578, -0.08878893], dtype=float32), array([-0.65626   , -0.14857645,  0.29733548, ..., -0.15890433,
365       -0.45487815, -0.28494897], dtype=float32), array([-0.1983894 ,  0.67196256,  0.7867421 , ..., -0.7990434 ,
366        0.05860569, -0.26884627], dtype=float32), array([-0.3775159 , -0.00590206,  0.5240432 , ..., -0.26754653,
367       -0.37806216,  0.23336883], dtype=float32), array([ 0.1876977 ,  0.30165672,  0.47167772, ..., -0.43823618,
368       -0.42823148, -0.48873612], dtype=float32), array([-0.6576557 , -0.09822252,  0.1121515 , ..., -0.21743725,
369       -0.1820574 , -0.16115054], dtype=float32)]
370