1:mod:`shlex` --- Simple lexical analysis 2======================================== 3 4.. module:: shlex 5 :synopsis: Simple lexical analysis for Unix shell-like languages. 6 7.. moduleauthor:: Eric S. Raymond <esr@snark.thyrsus.com> 8.. moduleauthor:: Gustavo Niemeyer <niemeyer@conectiva.com> 9.. sectionauthor:: Eric S. Raymond <esr@snark.thyrsus.com> 10.. sectionauthor:: Gustavo Niemeyer <niemeyer@conectiva.com> 11 12**Source code:** :source:`Lib/shlex.py` 13 14-------------- 15 16The :class:`~shlex.shlex` class makes it easy to write lexical analyzers for 17simple syntaxes resembling that of the Unix shell. This will often be useful 18for writing minilanguages, (for example, in run control files for Python 19applications) or for parsing quoted strings. 20 21The :mod:`shlex` module defines the following functions: 22 23 24.. function:: split(s, comments=False, posix=True) 25 26 Split the string *s* using shell-like syntax. If *comments* is :const:`False` 27 (the default), the parsing of comments in the given string will be disabled 28 (setting the :attr:`~shlex.commenters` attribute of the 29 :class:`~shlex.shlex` instance to the empty string). This function operates 30 in POSIX mode by default, but uses non-POSIX mode if the *posix* argument is 31 false. 32 33 .. note:: 34 35 Since the :func:`split` function instantiates a :class:`~shlex.shlex` 36 instance, passing ``None`` for *s* will read the string to split from 37 standard input. 38 39 40.. function:: join(split_command) 41 42 Concatenate the tokens of the list *split_command* and return a string. 43 This function is the inverse of :func:`split`. 44 45 >>> from shlex import join 46 >>> print(join(['echo', '-n', 'Multiple words'])) 47 echo -n 'Multiple words' 48 49 The returned value is shell-escaped to protect against injection 50 vulnerabilities (see :func:`quote`). 51 52 .. versionadded:: 3.8 53 54 55.. function:: quote(s) 56 57 Return a shell-escaped version of the string *s*. The returned value is a 58 string that can safely be used as one token in a shell command line, for 59 cases where you cannot use a list. 60 61 This idiom would be unsafe: 62 63 >>> filename = 'somefile; rm -rf ~' 64 >>> command = 'ls -l {}'.format(filename) 65 >>> print(command) # executed by a shell: boom! 66 ls -l somefile; rm -rf ~ 67 68 :func:`quote` lets you plug the security hole: 69 70 >>> from shlex import quote 71 >>> command = 'ls -l {}'.format(quote(filename)) 72 >>> print(command) 73 ls -l 'somefile; rm -rf ~' 74 >>> remote_command = 'ssh home {}'.format(quote(command)) 75 >>> print(remote_command) 76 ssh home 'ls -l '"'"'somefile; rm -rf ~'"'"'' 77 78 The quoting is compatible with UNIX shells and with :func:`split`: 79 80 >>> from shlex import split 81 >>> remote_command = split(remote_command) 82 >>> remote_command 83 ['ssh', 'home', "ls -l 'somefile; rm -rf ~'"] 84 >>> command = split(remote_command[-1]) 85 >>> command 86 ['ls', '-l', 'somefile; rm -rf ~'] 87 88 .. versionadded:: 3.3 89 90The :mod:`shlex` module defines the following class: 91 92 93.. class:: shlex(instream=None, infile=None, posix=False, punctuation_chars=False) 94 95 A :class:`~shlex.shlex` instance or subclass instance is a lexical analyzer 96 object. The initialization argument, if present, specifies where to read 97 characters from. It must be a file-/stream-like object with 98 :meth:`~io.TextIOBase.read` and :meth:`~io.TextIOBase.readline` methods, or 99 a string. If no argument is given, input will be taken from ``sys.stdin``. 100 The second optional argument is a filename string, which sets the initial 101 value of the :attr:`~shlex.infile` attribute. If the *instream* 102 argument is omitted or equal to ``sys.stdin``, this second argument 103 defaults to "stdin". The *posix* argument defines the operational mode: 104 when *posix* is not true (default), the :class:`~shlex.shlex` instance will 105 operate in compatibility mode. When operating in POSIX mode, 106 :class:`~shlex.shlex` will try to be as close as possible to the POSIX shell 107 parsing rules. The *punctuation_chars* argument provides a way to make the 108 behaviour even closer to how real shells parse. This can take a number of 109 values: the default value, ``False``, preserves the behaviour seen under 110 Python 3.5 and earlier. If set to ``True``, then parsing of the characters 111 ``();<>|&`` is changed: any run of these characters (considered punctuation 112 characters) is returned as a single token. If set to a non-empty string of 113 characters, those characters will be used as the punctuation characters. Any 114 characters in the :attr:`wordchars` attribute that appear in 115 *punctuation_chars* will be removed from :attr:`wordchars`. See 116 :ref:`improved-shell-compatibility` for more information. *punctuation_chars* 117 can be set only upon :class:`~shlex.shlex` instance creation and can't be 118 modified later. 119 120 .. versionchanged:: 3.6 121 The *punctuation_chars* parameter was added. 122 123.. seealso:: 124 125 Module :mod:`configparser` 126 Parser for configuration files similar to the Windows :file:`.ini` files. 127 128 129.. _shlex-objects: 130 131shlex Objects 132------------- 133 134A :class:`~shlex.shlex` instance has the following methods: 135 136 137.. method:: shlex.get_token() 138 139 Return a token. If tokens have been stacked using :meth:`push_token`, pop a 140 token off the stack. Otherwise, read one from the input stream. If reading 141 encounters an immediate end-of-file, :attr:`eof` is returned (the empty 142 string (``''``) in non-POSIX mode, and ``None`` in POSIX mode). 143 144 145.. method:: shlex.push_token(str) 146 147 Push the argument onto the token stack. 148 149 150.. method:: shlex.read_token() 151 152 Read a raw token. Ignore the pushback stack, and do not interpret source 153 requests. (This is not ordinarily a useful entry point, and is documented here 154 only for the sake of completeness.) 155 156 157.. method:: shlex.sourcehook(filename) 158 159 When :class:`~shlex.shlex` detects a source request (see :attr:`source` 160 below) this method is given the following token as argument, and expected 161 to return a tuple consisting of a filename and an open file-like object. 162 163 Normally, this method first strips any quotes off the argument. If the result 164 is an absolute pathname, or there was no previous source request in effect, or 165 the previous source was a stream (such as ``sys.stdin``), the result is left 166 alone. Otherwise, if the result is a relative pathname, the directory part of 167 the name of the file immediately before it on the source inclusion stack is 168 prepended (this behavior is like the way the C preprocessor handles ``#include 169 "file.h"``). 170 171 The result of the manipulations is treated as a filename, and returned as the 172 first component of the tuple, with :func:`open` called on it to yield the second 173 component. (Note: this is the reverse of the order of arguments in instance 174 initialization!) 175 176 This hook is exposed so that you can use it to implement directory search paths, 177 addition of file extensions, and other namespace hacks. There is no 178 corresponding 'close' hook, but a shlex instance will call the 179 :meth:`~io.IOBase.close` method of the sourced input stream when it returns 180 EOF. 181 182 For more explicit control of source stacking, use the :meth:`push_source` and 183 :meth:`pop_source` methods. 184 185 186.. method:: shlex.push_source(newstream, newfile=None) 187 188 Push an input source stream onto the input stack. If the filename argument is 189 specified it will later be available for use in error messages. This is the 190 same method used internally by the :meth:`sourcehook` method. 191 192 193.. method:: shlex.pop_source() 194 195 Pop the last-pushed input source from the input stack. This is the same method 196 used internally when the lexer reaches EOF on a stacked input stream. 197 198 199.. method:: shlex.error_leader(infile=None, lineno=None) 200 201 This method generates an error message leader in the format of a Unix C compiler 202 error label; the format is ``'"%s", line %d: '``, where the ``%s`` is replaced 203 with the name of the current source file and the ``%d`` with the current input 204 line number (the optional arguments can be used to override these). 205 206 This convenience is provided to encourage :mod:`shlex` users to generate error 207 messages in the standard, parseable format understood by Emacs and other Unix 208 tools. 209 210Instances of :class:`~shlex.shlex` subclasses have some public instance 211variables which either control lexical analysis or can be used for debugging: 212 213 214.. attribute:: shlex.commenters 215 216 The string of characters that are recognized as comment beginners. All 217 characters from the comment beginner to end of line are ignored. Includes just 218 ``'#'`` by default. 219 220 221.. attribute:: shlex.wordchars 222 223 The string of characters that will accumulate into multi-character tokens. By 224 default, includes all ASCII alphanumerics and underscore. In POSIX mode, the 225 accented characters in the Latin-1 set are also included. If 226 :attr:`punctuation_chars` is not empty, the characters ``~-./*?=``, which can 227 appear in filename specifications and command line parameters, will also be 228 included in this attribute, and any characters which appear in 229 ``punctuation_chars`` will be removed from ``wordchars`` if they are present 230 there. If :attr:`whitespace_split` is set to ``True``, this will have no 231 effect. 232 233 234.. attribute:: shlex.whitespace 235 236 Characters that will be considered whitespace and skipped. Whitespace bounds 237 tokens. By default, includes space, tab, linefeed and carriage-return. 238 239 240.. attribute:: shlex.escape 241 242 Characters that will be considered as escape. This will be only used in POSIX 243 mode, and includes just ``'\'`` by default. 244 245 246.. attribute:: shlex.quotes 247 248 Characters that will be considered string quotes. The token accumulates until 249 the same quote is encountered again (thus, different quote types protect each 250 other as in the shell.) By default, includes ASCII single and double quotes. 251 252 253.. attribute:: shlex.escapedquotes 254 255 Characters in :attr:`quotes` that will interpret escape characters defined in 256 :attr:`escape`. This is only used in POSIX mode, and includes just ``'"'`` by 257 default. 258 259 260.. attribute:: shlex.whitespace_split 261 262 If ``True``, tokens will only be split in whitespaces. This is useful, for 263 example, for parsing command lines with :class:`~shlex.shlex`, getting 264 tokens in a similar way to shell arguments. When used in combination with 265 :attr:`punctuation_chars`, tokens will be split on whitespace in addition to 266 those characters. 267 268 .. versionchanged:: 3.8 269 The :attr:`punctuation_chars` attribute was made compatible with the 270 :attr:`whitespace_split` attribute. 271 272 273.. attribute:: shlex.infile 274 275 The name of the current input file, as initially set at class instantiation time 276 or stacked by later source requests. It may be useful to examine this when 277 constructing error messages. 278 279 280.. attribute:: shlex.instream 281 282 The input stream from which this :class:`~shlex.shlex` instance is reading 283 characters. 284 285 286.. attribute:: shlex.source 287 288 This attribute is ``None`` by default. If you assign a string to it, that 289 string will be recognized as a lexical-level inclusion request similar to the 290 ``source`` keyword in various shells. That is, the immediately following token 291 will be opened as a filename and input will be taken from that stream until 292 EOF, at which point the :meth:`~io.IOBase.close` method of that stream will be 293 called and the input source will again become the original input stream. Source 294 requests may be stacked any number of levels deep. 295 296 297.. attribute:: shlex.debug 298 299 If this attribute is numeric and ``1`` or more, a :class:`~shlex.shlex` 300 instance will print verbose progress output on its behavior. If you need 301 to use this, you can read the module source code to learn the details. 302 303 304.. attribute:: shlex.lineno 305 306 Source line number (count of newlines seen so far plus one). 307 308 309.. attribute:: shlex.token 310 311 The token buffer. It may be useful to examine this when catching exceptions. 312 313 314.. attribute:: shlex.eof 315 316 Token used to determine end of file. This will be set to the empty string 317 (``''``), in non-POSIX mode, and to ``None`` in POSIX mode. 318 319 320.. attribute:: shlex.punctuation_chars 321 322 A read-only property. Characters that will be considered punctuation. Runs of 323 punctuation characters will be returned as a single token. However, note that no 324 semantic validity checking will be performed: for example, '>>>' could be 325 returned as a token, even though it may not be recognised as such by shells. 326 327 .. versionadded:: 3.6 328 329 330.. _shlex-parsing-rules: 331 332Parsing Rules 333------------- 334 335When operating in non-POSIX mode, :class:`~shlex.shlex` will try to obey to the 336following rules. 337 338* Quote characters are not recognized within words (``Do"Not"Separate`` is 339 parsed as the single word ``Do"Not"Separate``); 340 341* Escape characters are not recognized; 342 343* Enclosing characters in quotes preserve the literal value of all characters 344 within the quotes; 345 346* Closing quotes separate words (``"Do"Separate`` is parsed as ``"Do"`` and 347 ``Separate``); 348 349* If :attr:`~shlex.whitespace_split` is ``False``, any character not 350 declared to be a word character, whitespace, or a quote will be returned as 351 a single-character token. If it is ``True``, :class:`~shlex.shlex` will only 352 split words in whitespaces; 353 354* EOF is signaled with an empty string (``''``); 355 356* It's not possible to parse empty strings, even if quoted. 357 358When operating in POSIX mode, :class:`~shlex.shlex` will try to obey to the 359following parsing rules. 360 361* Quotes are stripped out, and do not separate words (``"Do"Not"Separate"`` is 362 parsed as the single word ``DoNotSeparate``); 363 364* Non-quoted escape characters (e.g. ``'\'``) preserve the literal value of the 365 next character that follows; 366 367* Enclosing characters in quotes which are not part of 368 :attr:`~shlex.escapedquotes` (e.g. ``"'"``) preserve the literal value 369 of all characters within the quotes; 370 371* Enclosing characters in quotes which are part of 372 :attr:`~shlex.escapedquotes` (e.g. ``'"'``) preserves the literal value 373 of all characters within the quotes, with the exception of the characters 374 mentioned in :attr:`~shlex.escape`. The escape characters retain its 375 special meaning only when followed by the quote in use, or the escape 376 character itself. Otherwise the escape character will be considered a 377 normal character. 378 379* EOF is signaled with a :const:`None` value; 380 381* Quoted empty strings (``''``) are allowed. 382 383.. _improved-shell-compatibility: 384 385Improved Compatibility with Shells 386---------------------------------- 387 388.. versionadded:: 3.6 389 390The :class:`shlex` class provides compatibility with the parsing performed by 391common Unix shells like ``bash``, ``dash``, and ``sh``. To take advantage of 392this compatibility, specify the ``punctuation_chars`` argument in the 393constructor. This defaults to ``False``, which preserves pre-3.6 behaviour. 394However, if it is set to ``True``, then parsing of the characters ``();<>|&`` 395is changed: any run of these characters is returned as a single token. While 396this is short of a full parser for shells (which would be out of scope for the 397standard library, given the multiplicity of shells out there), it does allow 398you to perform processing of command lines more easily than you could 399otherwise. To illustrate, you can see the difference in the following snippet: 400 401.. doctest:: 402 :options: +NORMALIZE_WHITESPACE 403 404 >>> import shlex 405 >>> text = "a && b; c && d || e; f >'abc'; (def \"ghi\")" 406 >>> s = shlex.shlex(text, posix=True) 407 >>> s.whitespace_split = True 408 >>> list(s) 409 ['a', '&&', 'b;', 'c', '&&', 'd', '||', 'e;', 'f', '>abc;', '(def', 'ghi)'] 410 >>> s = shlex.shlex(text, posix=True, punctuation_chars=True) 411 >>> s.whitespace_split = True 412 >>> list(s) 413 ['a', '&&', 'b', ';', 'c', '&&', 'd', '||', 'e', ';', 'f', '>', 'abc', ';', 414 '(', 'def', 'ghi', ')'] 415 416Of course, tokens will be returned which are not valid for shells, and you'll 417need to implement your own error checks on the returned tokens. 418 419Instead of passing ``True`` as the value for the punctuation_chars parameter, 420you can pass a string with specific characters, which will be used to determine 421which characters constitute punctuation. For example:: 422 423 >>> import shlex 424 >>> s = shlex.shlex("a && b || c", punctuation_chars="|") 425 >>> list(s) 426 ['a', '&', '&', 'b', '||', 'c'] 427 428.. note:: When ``punctuation_chars`` is specified, the :attr:`~shlex.wordchars` 429 attribute is augmented with the characters ``~-./*?=``. That is because these 430 characters can appear in file names (including wildcards) and command-line 431 arguments (e.g. ``--color=auto``). Hence:: 432 433 >>> import shlex 434 >>> s = shlex.shlex('~/a && b-c --color=auto || d *.py?', 435 ... punctuation_chars=True) 436 >>> list(s) 437 ['~/a', '&&', 'b-c', '--color=auto', '||', 'd', '*.py?'] 438 439 However, to match the shell as closely as possible, it is recommended to 440 always use ``posix`` and :attr:`~shlex.whitespace_split` when using 441 :attr:`~shlex.punctuation_chars`, which will negate 442 :attr:`~shlex.wordchars` entirely. 443 444For best effect, ``punctuation_chars`` should be set in conjunction with 445``posix=True``. (Note that ``posix=False`` is the default for 446:class:`~shlex.shlex`.) 447