• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..04-Nov-2021-

asset/H03-May-2022-

checkpoint/H04-Nov-2021-170

models/H04-Nov-2021-9151

tests/H04-Nov-2021-4322

utils/H04-Nov-2021-1,2191,026

.gitignoreH A D04-Nov-202126 42

BeamSearch.pyH A D04-Nov-20216.1 KiB170105

README.mdH A D04-Nov-20218.4 KiB255175

data_loader.pyH A D04-Nov-20213.3 KiB9567

infer.pyH A D04-Nov-20211.8 KiB5228

main.pyH A D04-Nov-20211.8 KiB4725

trainer.pyH A D04-Nov-20218.3 KiB233176

README.md

1<!---
2  Licensed to the Apache Software Foundation (ASF) under one
3  or more contributor license agreements.  See the NOTICE file
4  distributed with this work for additional information
5  regarding copyright ownership.  The ASF licenses this file
6  to you under the Apache License, Version 2.0 (the
7  "License"); you may not use this file except in compliance
8  with the License.  You may obtain a copy of the License at
9
10    http://www.apache.org/licenses/LICENSE-2.0
11
12  Unless required by applicable law or agreed to in writing,
13  software distributed under the License is distributed on an
14  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15  KIND, either express or implied.  See the License for the
16  specific language governing permissions and limitations
17  under the License.
18--->
19
20# LipNet: End-to-End Sentence-level Lipreading
21
22---
23
24This is a Gluon implementation of [LipNet: End-to-End Sentence-level Lipreading](https://arxiv.org/abs/1611.01599)
25
26![net_structure](asset/network_structure.png)
27
28![sample output](https://user-images.githubusercontent.com/11376047/52533982-d7227680-2d7e-11e9-9f18-c15b952faf0e.png)
29
30## Requirements
31- Python 3.6.4
32- MXNet 1.3.0
33- Required disk space: 35 GB
34```
35pip install -r requirements.txt
36```
37
38---
39
40## The Data
41- The GRID audiovisual sentence corpus (http://spandh.dcs.shef.ac.uk/gridcorpus/)
42  - GRID is a large multi-talker audiovisual sentence corpus to support joint computational-behavioral studies in speech perception. In brief, the corpus consists of high-quality audio and video (facial) recordings of 1000 sentences spoken by each of 34 talkers (18 male, 16 female). Sentences are of the form "put red at G9 now". The corpus, together with transcriptions, is freely available for research use.
43- Video: (normal)(480 M each)
44  - Each movie has one sentence consist of 6 words.
45- Align: word alignments (190 K each)
46  - One align has 6 words. Each word has start time and end time. But this tutorial needs just sentence because of using ctc-loss.
47
48---
49
50## Pretrained model
51You can train the model yourself in the following sections, you can test a pretrained model's inference, or resume training from the model checkpoint. To work with the provided pretrained model, first download it, then run one of the provided Python scripts for inference (infer.py) or training (main.py).
52
53* Download the [pretrained model](https://github.com/soeque1/temp_files/files/2848870/epoches_81_loss_15.7157.zip)
54* Try inference with the following:
55
56```
57python infer.py model_path='checkpoint/epoches_81_loss_15.7157'
58```
59
60* Resume training with the following:
61
62```
63python main.py model_path='checkpoint/epoches_81_loss_15.7157'
64```
65
66## Prepare the Data
67
68You can prepare the data yourself, or you can download preprocessed data.
69
70### Option 1 - Download the preprocessed data
71
72There are two download routes provided for the preprocessed data.
73
74#### Download and untar the data
75To download tar zipped files by link, download the following files and extract in a folder called `data` in the root of this example folder. You should have the following structure:
76```
77/lipnet/data/align
78/lipnet/data/datasets
79```
80
81* [align files](https://mxnet-public.s3.amazonaws.com/lipnet/data-archives/align.tgz)
82* [datasets files](https://mxnet-public.s3.amazonaws.com/lipnet/data-archives/datasets.tgz)
83
84#### Use AWS CLI to sync the data
85To get the folders and files all unzipped with AWS CLI, can use the following command. This will provide the folder structure for you. Run this command from `/lipnet/`:
86
87```
88 aws s3 sync s3://mxnet-public/lipnet/data .
89```
90
91### Option 2 (part 1)- Download the raw dataset
92- Outputs
93  - The Total Movies(mp4): 16GB
94  - The Total Aligns(text): 134MB
95- Arguments
96  - src_path : Path for videos (default='./data/mp4s/')
97  - align_path : Path for aligns (default='./data/')
98  - n_process : num of process (default=1)
99
100```
101cd ./utils && python download_data.py --n_process=$(nproc)
102```
103
104### Option 2 (part 2) Preprocess the raw dataset: Extracting the mouth images from a video and save it
105
106* Using Face Landmark Detection(http://dlib.net/)
107
108#### Preprocess (preprocess_data.py)
109*  If there is no landmark, it download automatically.
110*  Using Face Landmark Detection, It extract the mouth from a video.
111
112- example:
113 - video: ./data/mp4s/s2/bbbf7p.mpg
114 - align(target): ./data/align/s2/bbbf7p.align
115     : 'sil bin blue by f seven please sil'
116
117
118- Video to the images (75 Frames)
119
120Frame 0            |  Frame 1 | ... | Frame 74 |
121:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:
122![](asset/s2_bbbf7p_000.png)  |  ![](asset/s2_bbbf7p_001.png) |  ...  |  ![](asset/s2_bbbf7p_074.png)
123
124  - Extract the mouth from images
125
126Frame 0            |  Frame 1 | ... | Frame 74 |
127:-------------------------:|:-------------------------:|:-------------------------:|:-------------------------:
128![](asset/mouth_000.png)  |  ![](asset/mouth_001.png) |  ...  |  ![](asset/mouth_074.png)
129
130* Save the result images into tgt_path.
131
132----
133
134#### How to run the preprocess script
135
136- Arguments
137  - src_path : Path for videos (default='./data/mp4s/')
138  - tgt_path : Path for preprocessed images (default='./data/datasets/')
139  - n_process : num of process (default=1)
140
141- Outputs
142  - The Total Images(png): 19GB
143- Elapsed time
144  - About 54 Hours using 1 process
145  - If you use the multi-processes, you can finish the number of processes faster.
146    - e.g) 9 hours using 6 processes
147
148You can run the preprocessing with just one processor, but this will take a long time (>48 hours). To use all of the available processors, use the following command:
149
150```
151cd ./utils && python preprocess_data.py --n_process=$(nproc)
152```
153
154#### Output: Data structure of the preprocessed data
155
156```
157The training data folder should look like :
158<train_data_root>
159                |--datasets
160                        |--s1
161                           |--bbir7s
162                               |--mouth_000.png
163                               |--mouth_001.png
164                                   ...
165                           |--bgaa8p
166                               |--mouth_000.png
167                               |--mouth_001.png
168                                  ...
169                        |--s2
170                            ...
171                 |--align
172                         |--bw1d8a.align
173                         |--bggzzs.align
174                             ...
175
176```
177
178---
179
180## Training
181After you have acquired the preprocessed data you are ready to train the lipnet model.
182
183- According to [LipNet: End-to-End Sentence-level Lipreading](https://arxiv.org/abs/1611.01599), four (S1, S2, S20, S22) of the 34 subjects are used for evaluation.
184 The other subjects are used for training.
185
186- To use the multi-gpu, it is recommended to make the batch size $(num_gpus) times larger.
187
188  - e.g) 1-gpu and 128 batch_size > 2-gpus 256 batch_size
189
190
191- arguments
192  - batch_size : Define batch size (default=64)
193  - epochs : Define total epochs (default=100)
194  - image_path : Path for lip image files (default='./data/datasets/')
195  - align_path : Path for align files (default='./data/align/')
196  - dr_rate : Dropout rate(default=0.5)
197  - num_gpus : Num of gpus (if num_gpus is 0, then use cpu) (default=1)
198  - num_workers : Num of workers when generating data (default=0)
199  - model_path : Path of pretrained model (default=None)
200
201```
202python main.py
203```
204
205---
206
207## Test Environment
208- 72 CPU cores
209- 1 GPU (NVIDIA Tesla V100 SXM2 32 GB)
210- 128 Batch Size
211
212  -  It takes over 24 hours (60 epochs) to get some good results.
213
214---
215
216## Inference
217
218- arguments
219  - batch_size : Define batch size (default=64)
220  - image_path : Path for lip image files (default='./data/datasets/')
221  - align_path : Path for align files (default='./data/align/')
222  - num_gpus : Num of gpus (if num_gpus is 0, then use cpu) (default=1)
223  - num_workers : Num of workers when generating data (default=0)
224  - data_type : 'train' or 'valid' (defalut='valid')
225  - model_path : Path of pretrained model (default=None)
226
227```
228python infer.py --model_path=$(model_path)
229```
230
231
232```
233[Target]
234['lay green with a zero again',
235 'bin blue with r nine please',
236 'set blue with e five again',
237 'bin green by t seven soon',
238 'lay red at d five now',
239 'bin green in x eight now',
240 'bin blue with e one now',
241 'lay red at j nine now']
242 ```
243
244 ```
245[Pred]
246['lay green with s zero again',
247 'bin blue with r nine please',
248 'set blue with e five again',
249 'bin green by t seven soon',
250 'lay red at c five now',
251 'bin green in x eight now',
252 'bin blue with m one now',
253 'lay red at j nine now']
254 ```
255