• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

R/H24-Jul-2021-802577

build/H03-May-2022-

inst/doc/H24-Jul-2021-478350

man/H24-Jul-2021-517441

src/H24-Jul-2021-37,23830,433

tests/H24-Jul-2021-1,3731,058

vignettes/H24-Jul-2021-478350

DESCRIPTIONH A D24-Jul-20211.2 KiB3534

LICENSEH A D24-Jul-202111.1 KiB203169

MD5H A D24-Jul-20216.1 KiB106105

NAMESPACEH A D24-Jul-2021238 1412

README.mdH A D24-Jul-202110.3 KiB10867

README.md

1<!-- README.md is generated from README.Rmd. Please edit that file -->
2
3# utf8
4
5<!-- badges: start -->
6
7[![rcc](https://github.com/patperry/r-utf8/workflows/rcc/badge.svg)](https://github.com/patperry/r-utf8/actions) [![Coverage Status](https://codecov.io/github/patperry/r-utf8/coverage.svg?branch=master "Code Coverage")](https://codecov.io/github/patperry/r-utf8?branch=master "Code Coverage") [![CRAN Status](https://www.r-pkg.org/badges/version/utf8 "CRAN Page")](https://cran.r-project.org/package=utf8 "CRAN Page") [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg "Apache License, Version 2.0")](https://www.apache.org/licenses/LICENSE-2.0.html "Apache License, Version 2.0") [![CRAN RStudio Mirror Downloads](https://cranlogs.r-pkg.org/badges/utf8 "CRAN Downloads")](https://cran.r-project.org/package=utf8 "CRAN Page")
8
9<!-- badges: end -->
10
11*utf8* is an R package for manipulating and printing UTF-8 text that fixes [multiple](https://twitter.com/ptrckprry/status/901494853758054401 "Windows enc2utf8 Bug") [bugs](https://twitter.com/ptrckprry/status/887732831161425920 "MacOS Emoji Printing") in R’s UTF-8 handling.
12
13## Installation
14
15### Stable version
16
17*utf8* is [available on CRAN](https://cran.r-project.org/package=utf8 "CRAN Page"). To install the latest released version, run the following command in R:
18
19<pre class='chroma'>
20<span class='nf'><a href='https://rdrr.io/r/utils/install.packages.html'>install.packages</a></span><span class='o'>(</span><span class='s'>"utf8"</span><span class='o'>)</span></pre>
21
22### Development version
23
24To install the latest development version, run the following:
25
26<pre class='chroma'>
27<span class='nf'>devtools</span><span class='nf'>::</span><span class='nf'><a href='https://devtools.r-lib.org//reference/remote-reexports.html'>install_github</a></span><span class='o'>(</span><span class='s'>"patperry/r-utf8"</span><span class='o'>)</span></pre>
28
29## Usage
30
31<pre class='chroma'>
32<span class='kr'><a href='https://rdrr.io/r/base/library.html'>library</a></span><span class='o'>(</span><span class='nv'><a href='https://github.com/patperry/r-utf8'>utf8</a></span><span class='o'>)</span></pre>
33
34### Validate character data and convert to UTF-8
35
36Use [`as_utf8()`](https://rdrr.io/pkg/utf8/man/as_utf8.html) to validate input text and convert to UTF-8 encoding. The function alerts you if the input text has the wrong declared encoding:
37
38<pre class='chroma'>
39<span class='c'># second entry is encoded in latin-1, but declared as UTF-8</span>
40<span class='nv'>x</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://rdrr.io/r/base/c.html'>c</a></span><span class='o'>(</span><span class='s'>"fa\u00E7ile"</span>, <span class='s'>"fa\xE7ile"</span>, <span class='s'>"fa\xC3\xA7ile"</span><span class='o'>)</span>
41<span class='nf'><a href='https://rdrr.io/r/base/Encoding.html'>Encoding</a></span><span class='o'>(</span><span class='nv'>x</span><span class='o'>)</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://rdrr.io/r/base/c.html'>c</a></span><span class='o'>(</span><span class='s'>"UTF-8"</span>, <span class='s'>"UTF-8"</span>, <span class='s'>"bytes"</span><span class='o'>)</span>
42<span class='nf'><a href='https://rdrr.io/pkg/utf8/man/as_utf8.html'>as_utf8</a></span><span class='o'>(</span><span class='nv'>x</span><span class='o'>)</span> <span class='c'># fails</span>
43<span class='c'>#&gt; Error in as_utf8(x): entry 2 has wrong Encoding; marked as "UTF-8" but leading byte 0xE7 followed by invalid continuation byte (0x69) at position 4</span>
44
45<span class='c'># mark the correct encoding</span>
46<span class='nf'><a href='https://rdrr.io/r/base/Encoding.html'>Encoding</a></span><span class='o'>(</span><span class='nv'>x</span><span class='o'>[</span><span class='m'>2</span><span class='o'>]</span><span class='o'>)</span> <span class='o'>&lt;-</span> <span class='s'>"latin1"</span>
47<span class='nf'><a href='https://rdrr.io/pkg/utf8/man/as_utf8.html'>as_utf8</a></span><span class='o'>(</span><span class='nv'>x</span><span class='o'>)</span> <span class='c'># succeeds</span>
48<span class='c'>#&gt; [1] "façile" "façile" "façile"</span></pre>
49
50### Normalize data
51
52Use [`utf8_normalize()`](https://rdrr.io/pkg/utf8/man/utf8_normalize.html) to convert to Unicode composed normal form (NFC). Optionally apply compatibility maps for NFKC normal form or case-fold.
53
54<pre class='chroma'>
55<span class='c'># three ways to encode an angstrom character</span>
56<span class='o'>(</span><span class='nv'>angstrom</span> <span class='o'>&lt;-</span> <span class='nf'><a href='https://rdrr.io/r/base/c.html'>c</a></span><span class='o'>(</span><span class='s'>"\u00c5"</span>, <span class='s'>"\u0041\u030a"</span>, <span class='s'>"\u212b"</span><span class='o'>)</span><span class='o'>)</span>
57<span class='c'>#&gt; [1] "Å" "Å" "Å"</span>
58<span class='nf'><a href='https://rdrr.io/pkg/utf8/man/utf8_normalize.html'>utf8_normalize</a></span><span class='o'>(</span><span class='nv'>angstrom</span><span class='o'>)</span> <span class='o'>==</span> <span class='s'>"\u00c5"</span>
59<span class='c'>#&gt; [1] TRUE TRUE TRUE</span>
60
61<span class='c'># perform full Unicode case-folding</span>
62<span class='nf'><a href='https://rdrr.io/pkg/utf8/man/utf8_normalize.html'>utf8_normalize</a></span><span class='o'>(</span><span class='s'>"Größe"</span>, map_case <span class='o'>=</span> <span class='kc'>TRUE</span><span class='o'>)</span>
63<span class='c'>#&gt; [1] "grösse"</span>
64
65<span class='c'># apply compatibility maps to NFKC normal form</span>
66<span class='c'># (example from https://twitter.com/aprilarcus/status/367557195186970624)</span>
67<span class='nf'><a href='https://rdrr.io/pkg/utf8/man/utf8_normalize.html'>utf8_normalize</a></span><span class='o'>(</span><span class='s'>"���� �������������� �� �������� �� �������� ������������������ ���� ���� ������ �������� �������������������� ���� �������� �������������������������� ������������������������ ���������� ���� ������ ������ ������������ ���������� ���� �������� ����������."</span>,
68               map_compat <span class='o'>=</span> <span class='kc'>TRUE</span><span class='o'>)</span>
69<span class='c'>#&gt; [1] "Yo Unicode l herd U like typefaces so we put some codepoints in your Supplementary Wultilingval Plane so you can encode fonts in your fonts."</span></pre>
70
71### Print emoji
72
73On some platforms (including MacOS), the R implementation of [`print()`](https://rdrr.io/r/base/print.html) uses an outdated version of the Unicode standard to determine which characters are printable. Use [`utf8_print()`](https://rdrr.io/pkg/utf8/man/utf8_print.html) for an updated print function:
74
75<pre class='chroma'>
76<span class='nf'><a href='https://rdrr.io/r/base/print.html'>print</a></span><span class='o'>(</span><span class='nf'><a href='https://rdrr.io/r/base/utf8Conversion.html'>intToUtf8</a></span><span class='o'>(</span><span class='m'>0x1F600</span> <span class='o'>+</span> <span class='m'>0</span><span class='o'>:</span><span class='m'>79</span><span class='o'>)</span><span class='o'>)</span> <span class='c'># with default R print function</span>
77<span class='c'>#&gt; [1] "����������������������������������������������������������������������������������������������������������������������������������������������������������������"</span>
78
79<span class='nf'><a href='https://rdrr.io/pkg/utf8/man/utf8_print.html'>utf8_print</a></span><span class='o'>(</span><span class='nf'><a href='https://rdrr.io/r/base/utf8Conversion.html'>intToUtf8</a></span><span class='o'>(</span><span class='m'>0x1F600</span> <span class='o'>+</span> <span class='m'>0</span><span class='o'>:</span><span class='m'>79</span><span class='o'>)</span><span class='o'>)</span> <span class='c'># with utf8_print, truncates line</span>
80<span class='c'>#&gt; [1] "��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​…"</span>
81
82<span class='nf'><a href='https://rdrr.io/pkg/utf8/man/utf8_print.html'>utf8_print</a></span><span class='o'>(</span><span class='nf'><a href='https://rdrr.io/r/base/utf8Conversion.html'>intToUtf8</a></span><span class='o'>(</span><span class='m'>0x1F600</span> <span class='o'>+</span> <span class='m'>0</span><span class='o'>:</span><span class='m'>79</span><span class='o'>)</span>, chars <span class='o'>=</span> <span class='m'>1000</span><span class='o'>)</span> <span class='c'># higher character limit</span>
83<span class='c'>#&gt; [1] "��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​��​"</span></pre>
84
85## Citation
86
87Cite *utf8* with the following BibTeX entry:
88
89    @Manual{,
90      title = {utf8: Unicode Text Processing},
91      author = {Patrick O. Perry},
92      year = {2018},
93      note = {R package version 1.1.4},
94      url = {https://github.com/patperry/r-utf8},
95    }
96
97## Contributing
98
99The project maintainer welcomes contributions in the form of feature requests, bug reports, comments, unit tests, vignettes, or other code. If you’d like to contribute, either
100
101-   fork the repository and submit a pull request
102
103-   [file an issue](https://github.com/patperry/r-utf8/issues "Issues");
104
105-   or contact the maintainer via e-mail.
106
107This project is released with a [Contributor Code of Conduct](https://github.com/patperry/r-utf8/blob/master/CONDUCT.md "Contributor Code of Conduct"), and if you choose to contribute, you must adhere to its terms.
108