Character encodings for beginners (2023)

Question

What is a character encoding, and why should I care?

Answer

First, why should I care?

If you use anything other than the most basic English text, people may not be able to read the content you create unless you say what character encoding you used.

For example, you may intend the text to look like this:

Character encodings for beginners (1)

but it may actually display like this:

Character encodings for beginners (2)

Not only does lack of character encoding information spoil the readability of displayed text, but it may mean that your data cannot be found by a search engine, or reliably processed by machines in a number of other ways.

(Video) Character Encoding and Unicode Tutorial

So what's a character encoding?

Words and sentences in text are created from characters. Examples of characters include the Latin letter á or the Chinese ideograph 請 or the Devanagari character ह.

Characters that are needed for a specific purpose are grouped into a character set (also called a repertoire). (To refer to characters in an unambiguous way, each character is associated with a number, called a code point.)

The characters are stored in the computer as one or more bytes.

Basically, you can visualise this by assuming that all characters are stored in computers using a special code, like the ciphers used in espionage. A character encoding provides a key to unlock (ie. crack) the code. It is a set of mappings between the bytes in the computer and the characters in the character set. Without the key, the data looks like garbage.

So, when you input text using a keyboard or in some other way, the character encoding maps characters you choose to specific bytes in computer memory, and then to display the text it reads the bytes back into characters.

Unfortunately, there are many different character sets and character encodings, ie. many different ways of mapping between bytes, code points and characters. The section Additional information provides a little more detail for those who are interested.

(Video) Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more

Most of the time, however, you will not need to know the details. You will just need to be sure that you consider the advice in the section How does this affect me? below.

How do fonts fit into this?

A font is a collection of glyph definitions, ie. definitions of the shapes used to display characters.

Once your browser or app has worked out what characters it is dealing with, it will then look in the font for glyphs it can use to display or print those characters. (Of course, if the encoding information was wrong, it will be looking up glyphs for the wrong characters.)

A given font will usually cover a single character set, or in the case of a large character set like Unicode, just a subset of all the characters in the set. If your font doesn't have a glyph for a particular character, some browsers or software applications will look for the missing glyphs in other fonts on your system (which will mean that the glyph will look different from the surrounding text, like a ransom note). Otherwise you will typically see a square box, a question mark or some other character instead. For example:

Character encodings for beginners (3)

How does this affect me?

As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need. This greatly simplifies things. Using Unicode throughout your system also removes the need to track and convert between various character encodings.

Content authors need to find out how to declare the character encoding used for the document format they are working with.

Note that just declaring a different encoding in your page won't change the bytes; you need to save the text in that encoding too.

(Video) Character Encodings (Jack)

As a content author, you need to check what encoding your editor or scripts are saving text in, and how to save text in UTF-8. (It's usually the default these days.) You may also need to check that your server is serving documents with the right HTTP declarations.

Developers need to ensure that the various parts of the system can communicate with each other, understand which character encodings are being used, and support all the necessary encodings and characters. (Ideally, you would use UTF-8 throughout, and be spared this trouble.)

The links below provide some further reading on these topics.

Additional information

This section provides a little additional information on mapping between bytes, code points and characters for those who are interested. Feel free to just skip to the section Further reading.

In the coded character set called ISO8859-1 (also known as Latin1) the decimal code point value for the letter é is 233. However, in ISO8859-5, the same code point represents the Cyrillic character щ.

These character sets contain fewer than 256 characters and map code points to byte values directly, so a code point with the value 233 is represented by a single byte with a value of 233. Note that it is only the context that determines whether that byte represents either é or щ.

There are other ways of handling characters from a range of scripts. For example, with the Unicode character set, you can represent both characters in the same set. In fact, Unicode contains, in a single set, probably all the characters you are likely to ever need. While the letter é is still represented by the code point value 233, the Cyrillic character щ now has a code point value of 1097.

On the other hand, 1097 is too large a number to be represented by a single byte*. So, if you use the character encoding for Unicode text called UTF-8, щ will be represented by two bytes. However, the code point value is not simply derived from the value of the two bytes spliced together – some more complicated decoding is needed.

(Video) Character Encodings and Pass-through Configurations

Other Unicode characters map to one, three or four bytes in the UTF-8 encoding.

Furthermore, note that the letter é is also represented by two bytes in UTF-8, not the single byte used in ISO8859-1. (Only ASCII characters are encoded with a single byte in UTF-8.)

UTF-8 is the most widely used way to represent Unicode text in web pages, and you should always use UTF-8 when creating your web pages and databases. But, in principle, UTF-8 is only one of the possible ways of encoding Unicode characters. In other words, a single code point in the Unicode character set can actually be mapped to different byte sequences, depending on which encoding was used for the document. Unicode code points could be mapped to bytes using any one of the encodings called UTF-8, UTF-16 or UTF-32. The Devanagari character क, with code point 2325 (which is 915 in hexadecimal notation), will be represented by two bytes when using the UTF-16 encoding (09 15), three bytes with UTF-8 (E0 A4 95), or four bytes with UTF-32 (00 00 09 15).

There can be further complications beyond those described in this section (such as byte order and escape sequences), but the detail described here shows why it is important that the application you are working with knows which character encoding is appropriate for your data, and knows how to handle that encoding.

Further reading

The article Character encodings: Essential concepts provides some gentle introductions to related topics, such as Unicode, UTF-8, Character sets, coded character sets, and encodings, the document character set, character escapes and the HTTP header.

  • Getting started? Introducing Character Sets and Encodings – Points you to other W3C documents related to character sets and encodings

  • Tutorial, Handling character encodings in HTML and CSS – Advice on how to choose an encoding, declare it, and other related topics for HTML and CSS.

  • Setting the HTTP charset parameter – Working with character encoding declarations on the server or in scripting languages. http://www.w3.org/International/O-HTTP-charset

    (Video) Textual data explained | Introducing characters, strings, and encodings for programming beginners

  • Setting encoding in web authoring applications – How to get your editor to save in a different encoding for a list of editing environments.

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – by Joel Spolsky (takes you a little further into character encodings, but still gently)

  • Related links, Authoring HTML & CSS

    • Characters
    • Declaring the character encoding for HTML

FAQs

What character encoding should I use? ›

As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need. This greatly simplifies things.

What are the 3 types of character encoding? ›

There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32. Of these three, only UTF-8 should be used for Web content.

What is character encoding in simple words? ›

Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers.

Should I use UTF-8 or ISO 8859? ›

Most libraries that don't hold a lot of foreign language materials will be perfectly fine with ISO8859-1 ( also called Latin-1 or extended ASCII) encoding format, but if you do have a lot of foreign language materials you should choose UTF-8 since that provides access to a lot more foreign characters.

What is the most used character encoding? ›

UTF-8 (Unicode Transformation-8-bit) is now the most widely used character encoding format on the web, as it serves as a mapping method within Unicode.

What is the most common encoding? ›

UTF-8 is the most commonly used encoding scheme used on today's computer systems and computer networks.

What are the 7 types of characters? ›

7 Character Roles in Stories. If we categorize character types by the role they play in a narrative, we can hone in on seven distinct varieties: the protagonist, the antagonist, the love interest, the confidant, deuteragonists, tertiary characters, and the foil.

What are the four 4 types of encoding standards? ›

The common types of line encoding are Unipolar, Polar, Bipolar, and Manchester.

What is encoding very short answer? ›

encoded; encoding; encodes. transitive verb. : to convert (something, such as a body of information) from one system of communication into another. especially : to convert (a message) into code. : to convey symbolically.

What is encoding with example? ›

Encode means to change something into a programming code. For instance, changing a letter into the binary code for that letter or changing an analog sound into a digital file.

What are the types of encoding? ›

The four primary types of encoding are visual, acoustic, elaborative, and semantic. Encoding of memories in the brain can be optimized in a variety of ways, including mnemonics, chunking, and state-dependent learning.

Is UTF-8 outdated? ›

utf8 is currently an alias for utf8mb3 , but it is now deprecated as such, and utf8 is expected subsequently to become a reference to utf8mb4 . Beginning with MySQL 8.0.

What characters are not allowed in UTF-8? ›

Yes. 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units.

Can UTF-8 handle all characters? ›

UTF-8 extends the ASCII character set to use 8-bit code points, which allows for up to 256 different characters. This means that UTF-8 can represent all of the printable ASCII characters, as well as the non-printable characters.

What is the best type of encoding? ›

Semantic encoding involves a deeper level of processing than the shallower visual or acoustic encoding. Craik and Tulving concluded that we process verbal information best through semantic encoding, especially if we apply what is called the self-reference effect.

What is the best encoding strategy? ›

Retrieval is one of the best strategies to encode information into long-term memory. This strategy includes retrieving information by creating and taking a test. Creating tests allows the information to be processed at a deeper level. Retrieval is way better than just repeating something over and over.

How do you determine the encoding type? ›

An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII. An encoding sniffed by the chardet library, if you have it installed.
...
The output is the encoding name for example:
  1. iso-8859-1.
  2. us-ascii.
  3. utf-8.
12 Jan 2009

What is encoding in first grade? ›

In order to encode, the learner must take a whole word and break it down into the sounds they hear, translating the sounds into a symbolic representation, or word. This typically occurs at the end of kindergarten and into the beginning of first grade. Encoding is the process of spelling sounds with symbols.

How many encoding techniques are there? ›

12 different encoding techniques from basic to advanced

They are 12 basic encoding schemes that you should put in your tool kit.

What are the 4 main types of characters? ›

They are:
  • The Protagonist.
  • The Antagonist.
  • The Confidant.
  • The Love Interest.
29 Jan 2017

What is a 25 character? ›

Answer: 25 characters is between 3 words and 7 words with spaces included in the character count. If spaces are not included in the character count, then 25 characters is between 4 words and 9 words.

What are the six elements of character? ›

There are 6 key elements of character, which we'll consider in turn:
  • Courage.
  • Loyalty.
  • Diligence.
  • Modesty.
  • Honesty.
  • Gratitude.
27 Sept 2016

What are the steps for encoding? ›

4.1 Encoding, Storage, and Retrieval

Memory involves three main processes: encoding (the process by which information is put into memory), storage (the process by which information is maintained in memory), and retrieval (the process by which information is recovered from memory).

What are the three steps of encoding? ›

Three Stages of the Learning/Memory Process

Encoding is defined as the initial learning of information; storage refers to maintaining information over time; retrieval is the ability to access information when you need it.

What is a real life example of encoding? ›

For example, you may realize you're hungry and encode the following message to send to your roommate: “I'm hungry. Do you want to get pizza tonight?” As your roommate receives the message, they decode your communication and turn it back into thoughts to make meaning.

What type of encoding is ASCII? ›

ASCII encoding is based on character encoding used for telegraph data. The American National Standards Institute first published it as a standard for computing in 1963. Characters in ASCII encoding include upper- and lowercase letters A through Z, numerals 0 through 9 and basic punctuation symbols.

What type of encoding is UTF-32? ›

UTF-32 is an encoding of Unicode in which each character is composed of 4 bytes. The IBM® i operating system does not support UTF-32 encoding with a CCSID value. Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts.

What is encoding for kids? ›

Encoding is the process of hearing a sound and being able to write a symbol to represent that sound. Decoding is the opposite: it involves seeing a written symbol and being able to say what sound it represents.

Is encoding data easy? ›

Encoded data is easy to organize, even if the original data was mostly unstructured. This could be the easiest way to archive and organize your data in an automated manner. There are automation tools you can use to encode and archive your files as they are created.

What is the importance of encoding? ›

In order to convey meaning, the sender must begin encoding, which means translating information into a message in the form of symbols that represent ideas or concepts. This process translates the ideas or concepts into the coded message that will be communicated.

Where is encoding used? ›

So, encoding is the method or process of converting a series of characters, i.e, letters, numbers, punctuation, and symbols into a special or unique format for transmission or storage in computers.

What are encoding skills? ›

Encode. Encoding means the ability to hear a spoken sound and then write it down using an appropriate symbol. Children are taught to encode sounds during their phonics education, and they'll learn a multitude of graphemes (symbols) for each of the 44 phonics sounds.

Is ASCII same as UTF-8? ›

For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes3.

Why does UTF-32 exist? ›

UTF-32 is a multiple of 16bit. Working with 32 bit quantities is much more common than working with 24 bit quantities and is usually better supported. It also helps keep each character 4-byte aligned (assuming the entire string is 4-byte aligned).

Is Unicode better than ASCII? ›

The major disadvantage of ASCII is that it can represent only 256 different characters as it can use only 8 bits. ASCII cannot be used to encode the many types of characters found around the world. Unicode was extended further to UTF-16 and UTF-32 to encode the various types of characters.

What is difference between Unicode and UTF-8? ›

The Difference Between Unicode and UTF-8

Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

Why is it called UTF-16? ›

UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as code points are encoded with one or two 16-bit code units.

Why does UTF-16 exist? ›

UTF-16 allows all of the basic multilingual plane (BMP) to be represented as single code units. Unicode code points beyond U+FFFF are represented by surrogate pairs. The interesting thing is that Java and Windows (and other systems that use UTF-16) all operate at the code unit level, not the Unicode code point level.

Should I use UTF-8 or UTF-16? ›

There is a simple rule of thumb on what Unicode Transformation Form (UTF) to use: - utf-8 for storage and comunication - utf-16 for data processing - you might go with utf-32 if most of the platform API you use is utf-32 (common in the UNIX world).

Can ASCII be read as UTF-8? ›

It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.

Why is UTF-8 so popular? ›

UTF-8 is currently the most popular encoding method on the internet because it can efficiently store text containing any character. UTF-16 is another encoding method, but is less efficient for storing text files (except for those written in certain non-English languages).

Is ASCII or UTF-8 more efficient? ›

There's no difference between ASCII and UTF-8 when storing digits. A tighter packing would be using 4 bits per digit (BCD).

Is UTF-16 better than UTF-8? ›

UTF-16 is only more efficient than UTF-8 on some non-English websites. If a website uses a language with characters farther back in the Unicode library, UTF-8 will encode all characters as four bytes, whereas UTF-16 might encode many of the same characters as only two bytes.

How do you learn encoding? ›

Successful encoding techniques usually involve tying in the new information into previously known information. And one way to do this is called "chunking." And by chunking, we actually group the information that we're getting into meaningful units. So this ties it into maybe meaningful categories that we already know.

What is an example of encoding? ›

For example: if a child hears the sound /t/ and then writes the letter 't', this means they are able to encode this sound.

Why is ASCII not enough for the world? ›

The ASCII character set is barely large enough for US English use and lacks many glyphs common in typesetting, and far too small for universal use.

What is a disadvantage of ASCII? ›

Disadvantages. Limited character set. Even with extended ASCII, only 255 distinct characters can be represented. The characters in a standard character set are enough for English language communications.

Which is better ASCII or Unicode? ›

Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of text such as symbols, letters, digits, etc.

Is ASCII valid UTF-8? ›

UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8.

Videos

1. Code Pages, Character Encoding, Unicode, UTF-8 and the BOM - Computer Stuff They Didn't Teach You #2
(Scott Hanselman)
2. Character Encoding - 🅷🅰🅽🅳🆂 🅾🅽 🅲🆁🅰🆂🅷 🅲🅾🆄🆁🆂🅴
(DevInsideYou)
3. Scripts and Character Encoding
(The Unicode Consortium)
4. Character Sets and Encoding Systems
(Natujenge)
5. Strings and character encodings
(INF200 2019)
6. What is Character Encoding and Character Sets ?
(Computer Wallah)
Top Articles
Latest Posts
Article information

Author: Aron Pacocha

Last Updated: 02/10/2023

Views: 5826

Rating: 4.8 / 5 (68 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Aron Pacocha

Birthday: 1999-08-12

Address: 3808 Moen Corner, Gorczanyport, FL 67364-2074

Phone: +393457723392

Job: Retail Consultant

Hobby: Jewelry making, Cooking, Gaming, Reading, Juggling, Cabaret, Origami

Introduction: My name is Aron Pacocha, I am a happy, tasty, innocent, proud, talented, courageous, magnificent person who loves writing and wants to share my knowledge and understanding with you.