Alex Farach | Let's start at the beginning - bits to character encoding in R | RStudio (2022)

video

Oct 24, 2022

4:44

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, my name is Alex Farach. Welcome to Let's Start at the Beginning, Bits to Character Encoding in R. I'm a data scientist and analytics manager at Accenture Federal Services. I spend a lot of time thinking about R, NLP, data viz, and statistical learning. My most recent work is on the Hugging Face R package to bring hugging face models and tools to R. So how would you describe the letter A to a computer? Today, I'll discuss how. In doing so, we'll get into the shallow end of a deep pool known as bits to character encoding. And we'll talk a little bit about what that looks like in R.

What is character encoding?

So what do we mean by bits to character encoding, especially in relationship to NLP? Well, computers don't speak the same language as we do. If we want to use computers as a process language, we need some way to encode language so that computers understand and vice versa. Both the encoding and decoding process needs to be identical on both ends or the human computer translation falls apart. The agreed upon method is called a protocol.

So back to the question, how would you describe the letter A to a computer? Oh, we do so by using numbers, of course, to represent letters. So instead of A, we could have one, B could be two, and so on. We assign a number to represent each unique character or symbol we want to communicate. Computer store numbers in binary, and all signals inside a computer have two possible values, zero or one. Each of the zero one pieces of information are called bits. One bit can represent two value, two bits can represent four, three bits represent eight, and so on. So how many bits is needed to represent 256 values? Using the logic just discussed, it's eight bits.

Using the logic just discussed, it's eight bits.

A brief history of ASCII, Latin-1, and UTF-8

When computers were first developed, the decision was made that a byte was the next standard unit above a bit. A byte is defined as eight bits. In the 1960s, computer developers had one byte to work with. They developed a protocol to fit into this one byte called ASCII. This protocol can assign a unique number to all letters in the English language, plus numbers, symbols, and control characters. Everything fits into this one byte budget, and even better, seven of the available eight bits were needed.

Over time, the need to fit more information grew larger, but luckily, there was that extra bit to work with. Things got standardized into Latin one, which uses a full eight bits available in the single byte. So then after that, a new standard was developed, UTF-8. Each character in UTF-8 is represented using a prefix U followed by a four digit hexadecimal number. The difference between binary and hexadecimal numbers are out of scope for this brief talk. What's important to note here though is that UTF-8 converts a code point or a single character in Unicode into a set of one to four bytes and can encode all one million plus Unicode code points.

Character encoding in R

Character strings in R can be encoded in Latin one, UTF-8, or as raw bytes. These declarations can be read or changed using the encoding function. The concept of native encoding is important here. Natively encoded strings are strings written in whatever code page the user is using. As of the 4.2.0 release, R uses UTF-8 as the native encoding. Prior to this release, that was not the case because Windows versions prior to Windows 10 didn't allow it. Post Windows 10, this was allowed and R was updated to take advantage of this feature. This is an exciting change for us focusing on NLP and R since it mitigates a lot of potential problems for common NLP tasks.

It's also helpful to keep this in mind when using common functions like read.csv where character encoding can be set. On Linux and macOS, the native encoding has been UTF-8 so this isn't an issue there. Native encoding is why we see unknown here as well. For example, we see coffee as unknown even though it's clearly ASCII compliant. ASCII strings will never be marked with a declared encoding since their representation is the same in all supported encodings. The sys.getlocal function will tell you what your native encoding is. The localization information function is another helpful function. And lastly, we have the iconvert function which uses system facilities to convert a character vector between encodings.

Encoding in the tidyverse

So how does a tidyverse tackle encoding? We turn to the stringer library which is built on top of the extensive, excuse me, stringy package. We can specify the encoding of a string with the string convert function and for fun let's create a little function here that applies a random encoding via the stringiStringEncodeList function. And that's it. Thanks for hanging out with me.

Featured software#