1Function mode for Search & replace in the Editor
2================================================
3
4The :guilabel:`Search & replace` tool in the editor support a *function mode*.
5In this mode, you can combine regular expressions (see :doc:`regexp`) with
6arbitrarily powerful Python functions to do all sorts of advanced text
7processing.
8
9In the standard *regexp* mode for search and replace, you specify both a
10regular expression to search for as well as a template that is used to replace
11all found matches. In function mode, instead of using a fixed template, you
12specify an arbitrary function, in the
13`Python programming language <https://docs.python.org>`_. This allows
14you to do lots of things that are not possible with simple templates.
15
16Techniques for using function mode and the syntax will be described by means of
17examples, showing you how to create functions to perform progressively more
18complex tasks.
19
20
21.. image:: images/function_replace.png
22    :alt: The Function mode
23    :align: center
24
25Automatically fixing the case of headings in the document
26---------------------------------------------------------
27
28Here, we will leverage one of the builtin functions in the editor to
29automatically change the case of all text inside heading tags to title case::
30
31    Find expression: <([Hh][1-6])[^>]*>.+?</\1>
32
33For the function, simply choose the :guilabel:`Title-case text (ignore tags)` builtin
34function. The will change titles that look like: ``<h1>some TITLE</h1>`` to
35``<h1>Some Title</h1>``. It will work even if there are other HTML tags inside
36the heading tags.
37
38
39Your first custom function - smartening hyphens
40-----------------------------------------------
41
42The real power of function mode comes from being able to create your own
43functions to process text in arbitrary ways. The Smarten Punctuation tool in
44the editor leaves individual hyphens alone, so you can use the this function to
45replace them with em-dashes.
46
47To create a new function, simply click the :guilabel:`Create/edit` button to create a new
48function and copy the Python code from below.
49
50.. code-block:: python
51
52    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
53        return match.group().replace('--', '—').replace('-', '—')
54
55Every :guilabel:`Search & replace` custom function must have a unique name and consist of a
56Python function named replace, that accepts all the arguments shown above.
57For the moment, we won't worry about all the different arguments to
58``replace()`` function. Just focus on the ``match`` argument. It represents a
59match when running a search and replace. Its full documentation in available
60`here <https://docs.python.org/library/re.html#match-objects>`_.
61``match.group()`` simply returns all the matched text and all we do is replace
62hyphens in that text with em-dashes, first replacing double hyphens and
63then single hyphens.
64
65Use this function with the find regular expression::
66
67    >[^<>]+<
68
69And it will replace all hyphens with em-dashes, but only in actual text and not
70inside HTML tag definitions.
71
72
73The power of function mode - using a spelling dictionary to fix mis-hyphenated words
74------------------------------------------------------------------------------------
75
76Often, e-books created from scans of printed books contain mis-hyphenated words
77-- words that were split at the end of the line on the printed page. We will
78write a simple function to automatically find and fix such words.
79
80.. code-block:: python
81
82    import regex
83    from calibre import replace_entities
84    from calibre import prepare_string_for_xml
85
86    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
87
88        def replace_word(wmatch):
89            # Try to remove the hyphen and replace the words if the resulting
90            # hyphen free word is recognized by the dictionary
91            without_hyphen = wmatch.group(1) + wmatch.group(2)
92            if dictionaries.recognized(without_hyphen):
93                return without_hyphen
94            return wmatch.group()
95
96        # Search for words split by a hyphen
97        text = replace_entities(match.group()[1:-1])  # Handle HTML entities like &amp;
98        corrected = regex.sub(r'(\w+)\s*-\s*(\w+)', replace_word, text, flags=regex.VERSION1 | regex.UNICODE)
99        return '>%s<' % prepare_string_for_xml(corrected)  # Put back required entities
100
101Use this function with the same find expression as before, namely::
102
103    >[^<>]+<
104
105And it will magically fix all mis-hyphenated words in the text of the book. The
106main trick is to use one of the useful extra arguments to the replace function,
107``dictionaries``.  This refers to the dictionaries the editor itself uses to
108spell check text in the book. What this function does is look for words
109separated by a hyphen, remove the hyphen and check if the dictionary recognizes
110the composite word, if it does, the original words are replaced by the hyphen
111free composite word.
112
113Note that one limitation of this technique is it will only work for
114mono-lingual books, because, by default, ``dictionaries.recognized()`` uses the
115main language of the book.
116
117
118Auto numbering sections
119-----------------------
120
121Now we will see something a little different. Suppose your HTML file has many
122sections, each with a heading in an :code:`<h2>` tag that looks like
123:code:`<h2>Some text</h2>`. You can create a custom function that will
124automatically number these headings with consecutive section numbers, so that
125they look like :code:`<h2>1. Some text</h2>`.
126
127.. code-block:: python
128
129    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
130        section_number = '%d. ' % number
131        return match.group(1) + section_number + match.group(2)
132
133    # Ensure that when running over multiple files, the files are processed
134    # in the order in which they appear in the book
135    replace.file_order = 'spine'
136
137Use it with the find expression::
138
139    (?s)(<h2[^<>]*>)(.+?</h2>)
140
141Place the cursor at the top of the file and click :guilabel:`Replace all`.
142
143This function uses another of the useful extra arguments to ``replace()``: the
144``number`` argument. When doing a :guilabel:`Replace All` number is
145automatically incremented for every successive match.
146
147Another new feature is the use of ``replace.file_order`` -- setting that to
148``'spine'`` means that if this search is run on multiple HTML files, the files
149are processed in the order in which they appear in the book. See
150:ref:`file_order_replace_all` for details.
151
152
153Auto create a Table of Contents
154-------------------------------
155
156Finally, lets try something a little more ambitious. Suppose your book has
157headings in ``h1`` and ``h2`` tags that look like
158``<h1 id="someid">Some Text</h1>``. We will auto-generate an HTML Table of
159Contents based on these headings. Create the custom function below:
160
161.. code-block:: python
162
163    from calibre import replace_entities
164    from calibre.ebooks.oeb.polish.toc import TOC, toc_to_html
165    from calibre.gui2.tweak_book import current_container
166    from calibre.ebooks.oeb.base import xml2str
167
168    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
169        if match is None:
170            # All matches found, output the resulting Table of Contents.
171            # The argument metadata is the metadata of the book being edited
172            if 'toc' in data:
173                toc = data['toc']
174                root = TOC()
175                for (file_name, tag_name, anchor, text) in toc:
176                    parent = root.children[-1] if tag_name == 'h2' and root.children else root
177                    parent.add(text, file_name, anchor)
178                toc = toc_to_html(root, current_container(), 'toc.html', 'Table of Contents for ' + metadata.title, metadata.language)
179                print (xml2str(toc))
180            else:
181                print ('No headings to build ToC from found')
182        else:
183            # Add an entry corresponding to this match to the Table of Contents
184            if 'toc' not in data:
185                # The entries are stored in the data object, which will persist
186                # for all invocations of this function during a 'Replace All' operation
187                data['toc'] = []
188            tag_name, anchor, text = match.group(1), replace_entities(match.group(2)), replace_entities(match.group(3))
189            data['toc'].append((file_name, tag_name, anchor, text))
190            return match.group()  # We don't want to make any actual changes, so return the original matched text
191
192    # Ensure that we are called once after the last match is found so we can
193    # output the ToC
194    replace.call_after_last_match = True
195    # Ensure that when running over multiple files, this function is called,
196    # the files are processed in the order in which they appear in the book
197    replace.file_order = 'spine'
198
199And use it with the find expression::
200
201    <(h[12]) [^<>]* id=['"]([^'"]+)['"][^<>]*>([^<>]+)
202
203Run the search on :guilabel:`All text files` and at the end of the search, a
204window will popup with "Debug output from your function" which will have the
205HTML Table of Contents, ready to be pasted into :file:`toc.html`.
206
207The function above is heavily commented, so it should be easy to follow. The
208key new feature is the use of another useful extra argument to the
209``replace()`` function, the ``data`` object. The ``data`` object is a Python
210*dict* that persists between all successive invocations of ``replace()`` during
211a single :guilabel:`Replace All` operation.
212
213Another new feature is the use of ``call_after_last_match`` -- setting that to
214``True`` on the ``replace()`` function means that the editor will call
215``replace()`` one extra time after all matches have been found. For this extra
216call, the match object will be ``None``.
217
218This was just a demonstration to show you the power of function mode,
219if you really needed to generate a Table of Contents from headings in your book,
220you would be better off using the dedicated Table of Contents tool in
221:guilabel:`Tools->Table of Contents`.
222
223The API for the function mode
224-----------------------------
225
226All function mode functions must be Python functions named replace, with the
227following signature::
228
229    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
230        return a_string
231
232When a find/replace is run, for every match that is found, the ``replace()``
233function will be called, it must return the replacement string for that match.
234If no replacements are to be done, it should return ``match.group()`` which is
235the original string. The various arguments to the ``replace()`` function are
236documented below.
237
238The ``match`` argument
239^^^^^^^^^^^^^^^^^^^^^^
240
241The ``match`` argument represents the currently found match. It is a
242`Python Match object <https://docs.python.org/library/re.html#match-objects>`_.
243Its most useful method is ``group()`` which can be used to get the matched
244text corresponding to individual capture groups in the search regular
245expression.
246
247The ``number`` argument
248^^^^^^^^^^^^^^^^^^^^^^^
249
250The ``number`` argument is the number of the current match. When you run
251:guilabel:`Replace All`, every successive match will cause ``replace()`` to be
252called with an increasing number. The first match has number 1.
253
254The ``file_name`` argument
255^^^^^^^^^^^^^^^^^^^^^^^^^^
256
257This is the filename of the file in which the current match was found. When
258searching inside marked text, the ``file_name`` is empty. The ``file_name`` is
259in canonical form, a path relative to the root of the book, using ``/`` as the
260path separator.
261
262The ``metadata`` argument
263^^^^^^^^^^^^^^^^^^^^^^^^^
264
265This represents the metadata of the current book, such as title, authors,
266language, etc. It is an object of class :class:`calibre.ebooks.metadata.book.base.Metadata`.
267Useful attributes include, ``title``, ``authors`` (a list of authors) and
268``language`` (the language code).
269
270The ``dictionaries`` argument
271^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
272
273This represents the collection of dictionaries used for spell checking the
274current book. Its most useful method is ``dictionaries.recognized(word)``
275which will return ``True`` if the passed in word is recognized by the dictionary
276for the current book's language.
277
278The ``data`` argument
279^^^^^^^^^^^^^^^^^^^^^
280
281This a simple Python ``dict``. When you run
282:guilabel:`Replace all`, every successive match will cause ``replace()`` to be
283called with the same ``dict`` as data. You can thus use it to store arbitrary
284data between invocations of ``replace()`` during a :guilabel:`Replace all`
285operation.
286
287The ``functions`` argument
288^^^^^^^^^^^^^^^^^^^^^^^^^^
289
290The ``functions`` argument gives you access to all other user defined
291functions. This is useful for code re-use. You can define utility functions in
292one place and re-use them in all your other functions. For example, suppose you
293create a function name ``My Function`` like this:
294
295.. code-block:: python
296
297    def utility():
298       # do something
299
300    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
301        ...
302
303Then, in another function, you can access the ``utility()`` function like this:
304
305.. code-block:: python
306
307    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
308        utility = functions['My Function']['utility']
309        ...
310
311You can also use the functions object to store persistent data, that can be
312re-used by other functions. For example, you could have one function that when
313run with :guilabel:`Replace All` collects some data and another function that
314uses it when it is run afterwards. Consider the following two functions:
315
316.. code-block:: python
317
318    # Function One
319    persistent_data = {}
320
321    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
322        ...
323        persistent_data['something'] = 'some data'
324
325    # Function Two
326    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
327        persistent_data = functions['Function One']['persistent_data']
328        ...
329
330Debugging your functions
331^^^^^^^^^^^^^^^^^^^^^^^^
332
333You can debug the functions you create by using the standard ``print()``
334function from Python. The output of print will be displayed in a popup window
335after the Find/replace has completed. You saw an example of using ``print()``
336to output an entire table of contents above.
337
338.. _file_order_replace_all:
339
340Choose file order when running on multiple HTML files
341^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
342
343When you run a :guilabel:`Replace all` on multiple HTML files, the order in
344which the files are processes depends on what files you have open for editing.
345You can force the search to process files in the order in which the appear by
346setting the ``file_order`` attribute on your function, like this:
347
348.. code-block:: python
349
350    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
351        ...
352
353    replace.file_order = 'spine'
354
355``file_order`` accepts two values, ``spine`` and ``spine-reverse`` which cause
356the search to process multiple files in the order they appear in the book,
357either forwards or backwards, respectively.
358
359Having your function called an extra time after the last match is found
360^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
361
362Sometimes, as in the auto generate table of contents example above, it is
363useful to have your function called an extra time after the last match is
364found. You can do this by setting the ``call_after_last_match`` attribute on your
365function, like this:
366
367.. code-block:: python
368
369    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
370        ...
371
372    replace.call_after_last_match = True
373
374
375Appending the output from the function to marked text
376^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
377
378When running search and replace on marked text, it is sometimes useful to
379append so text to the end of the marked text. You can do that by setting
380the ``append_final_output_to_marked`` attribute on your function (note that you
381also need to set ``call_after_last_match``), like this:
382
383.. code-block:: python
384
385    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
386        ...
387        return 'some text to append'
388
389    replace.call_after_last_match = True
390    replace.append_final_output_to_marked = True
391
392Suppressing the result dialog when performing searches on marked text
393^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
394
395You can also suppress the result dialog (which can slow down the repeated
396application of a search/replace on many blocks of text) by setting
397the ``suppress_result_dialog`` attribute on your function, like this:
398
399.. code-block:: python
400
401    def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
402        ...
403
404    replace.suppress_result_dialog = True
405
406
407More examples
408----------------
409
410More useful examples, contributed by calibre users, can be found in the
411`calibre E-book editor forum <https://www.mobileread.com/forums/showthread.php?t=237181>`_.
412