Files
CSC110/02-functions/08-representing-text.html
T
Hykilpikonna 6fffdf686a deploy
2021-12-07 22:28:01 -05:00

206 lines
14 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<title>2.8 Application: Representing Text</title>
<style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
span.underline{text-decoration: underline;}
div.column{display: inline-block; vertical-align: top; width: 50%;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
ul.task-list{list-style: none;}
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }
@media screen {
div.sourceCode { overflow: auto; }
}
@media print {
pre > code.sourceCode { white-space: pre-wrap; }
pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
}
pre.numberSource code
{ counter-reset: source-line 0; }
pre.numberSource code > span
{ position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
{ content: counter(source-line);
position: relative; left: -1em; text-align: right; vertical-align: baseline;
border: none; display: inline-block;
-webkit-touch-callout: none; -webkit-user-select: none;
-khtml-user-select: none; -moz-user-select: none;
-ms-user-select: none; user-select: none;
padding: 0 4px; width: 4em;
color: #aaaaaa;
}
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; }
div.sourceCode
{ }
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
code span.al { color: #ff0000; font-weight: bold; } /* Alert */
code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */
code span.at { color: #7d9029; } /* Attribute */
code span.bn { color: #40a070; } /* BaseN */
code span.bu { } /* BuiltIn */
code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4070a0; } /* Char */
code span.cn { color: #880000; } /* Constant */
code span.co { color: #60a0b0; font-style: italic; } /* Comment */
code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */
code span.do { color: #ba2121; font-style: italic; } /* Documentation */
code span.dt { color: #902000; } /* DataType */
code span.dv { color: #40a070; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #40a070; } /* Float */
code span.fu { color: #06287e; } /* Function */
code span.im { } /* Import */
code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */
code span.kw { color: #007020; font-weight: bold; } /* Keyword */
code span.op { color: #666666; } /* Operator */
code span.ot { color: #007020; } /* Other */
code span.pp { color: #bc7a00; } /* Preprocessor */
code span.sc { color: #4070a0; } /* SpecialChar */
code span.ss { color: #bb6688; } /* SpecialString */
code span.st { color: #4070a0; } /* String */
code span.va { color: #19177c; } /* Variable */
code span.vs { color: #4070a0; } /* VerbatimString */
code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */
</style>
<link rel="stylesheet" href="../tufte.css" />
<script src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js" type="text/javascript"></script>
<!--[if lt IE 9]>
<script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
<![endif]-->
</head>
<body>
<div style="display:none">
\(
\newcommand{\NOT}{\neg}
\newcommand{\AND}{\wedge}
\newcommand{\OR}{\vee}
\newcommand{\XOR}{\oplus}
\newcommand{\IMP}{\Rightarrow}
\newcommand{\IFF}{\Leftrightarrow}
\newcommand{\TRUE}{\text{True}\xspace}
\newcommand{\FALSE}{\text{False}\xspace}
\newcommand{\IN}{\,{\in}\,}
\newcommand{\NOTIN}{\,{\notin}\,}
\newcommand{\TO}{\rightarrow}
\newcommand{\DIV}{\mid}
\newcommand{\NDIV}{\nmid}
\newcommand{\MOD}[1]{\pmod{#1}}
\newcommand{\MODS}[1]{\ (\text{mod}\ #1)}
\newcommand{\N}{\mathbb N}
\newcommand{\Z}{\mathbb Z}
\newcommand{\Q}{\mathbb Q}
\newcommand{\R}{\mathbb R}
\newcommand{\C}{\mathbb C}
\newcommand{\cA}{\mathcal A}
\newcommand{\cB}{\mathcal B}
\newcommand{\cC}{\mathcal C}
\newcommand{\cD}{\mathcal D}
\newcommand{\cE}{\mathcal E}
\newcommand{\cF}{\mathcal F}
\newcommand{\cG}{\mathcal G}
\newcommand{\cH}{\mathcal H}
\newcommand{\cI}{\mathcal I}
\newcommand{\cJ}{\mathcal J}
\newcommand{\cL}{\mathcal L}
\newcommand{\cK}{\mathcal K}
\newcommand{\cN}{\mathcal N}
\newcommand{\cO}{\mathcal O}
\newcommand{\cP}{\mathcal P}
\newcommand{\cQ}{\mathcal Q}
\newcommand{\cS}{\mathcal S}
\newcommand{\cT}{\mathcal T}
\newcommand{\cV}{\mathcal V}
\newcommand{\cW}{\mathcal W}
\newcommand{\cZ}{\mathcal Z}
\newcommand{\emp}{\emptyset}
\newcommand{\bs}{\backslash}
\newcommand{\floor}[1]{\left \lfloor #1 \right \rfloor}
\newcommand{\ceil}[1]{\left \lceil #1 \right \rceil}
\newcommand{\abs}[1]{\left | #1 \right |}
\newcommand{\xspace}{}
\newcommand{\proofheader}[1]{\underline{\textbf{#1}}}
\)
</div>
<header id="title-block-header">
<h1 class="title">2.8 Application: Representing Text</h1>
</header>
<section>
<p>We have mentioned that computers use a series of 0s and 1s to store data. These 0s and 1s represent numbers. So then, how can numbers represent textual data (i.e., a string)? The answer is functions.</p>
<p>Once upon a time, humans interacted with computers through punched paper tape (or simply punched tape). A hole (or the lack of a hole) at a particular location on the tape represented a 0 or a 1 (i.e., binary). Today we would call each 0 or 1 a <strong>bit</strong>. Obviously, this is much more tedious than using our modern input peripherals: keyboards, mice, touch screens, etc. Eventually, a standard for representing characters (e.g., letters, numbers) with holes was settled on. Using only 7 locations on the tape, 128 different characters could be represented (<span class="math inline">\(2^7 = 128\)</span>).</p>
<p>The standard was called ASCII (pronounced ass-key) and it persists to this day. You can think of the ASCII standard as a function with domain <span class="math inline">\(\{0, 1, \dots, 127\}\)</span>, whose codomain is the set of all possible characters. This function is <em>one-to-one</em>, meaning no two numbers map to the same character—this would be redundant for the purpose of encoding the characters. This standard covered all English letters (lowercase and uppercase), digits, punctuation, and various others (e.g., to communicate a new line). For example, the number 65 mapped to the letter <code>'A'</code> and the number 126 mapped to the punctuation mark <code>'~'</code>.</p>
<p>But what about other languages? Computer scientists extended ASCII from length-7 to length-8 sequences of bits, and hence its domain increased to size 256 (<span class="math inline">\(\{0, 1, \dots, 255\}\)</span>). This allowed “extended ASCII” to support some other characters used in similar Latin-based languages, such as <code>'é'</code> (233), <code>'ö'</code> (246), <code>'€'</code> (128), and other useful symbols like <code>'©'</code> (169) and <code>'½'</code> (189). But what about characters used in very different languages (e.g., Greek, Mandarin, Arabic)?</p>
<p>The latest standard, Unicode, uses <strong>up to 32 bits</strong> that gives us a domain of <span class="math inline">\(\{0, 1, \dots, 2^{32} - 1\}\)</span>, over 4 billion different numbers. This number is in fact larger than the number of distinct characters in use across all different languages! There are several <em>unused numbers</em> in the domain of Unicode—Unicode is not technically a function defined over <span class="math inline">\(\{0, 1, \dots, 2^{32} - 1\}\)</span> because of this.</p>
<p>But with the pervasiveness of the Internet, these unused numbers are being used to <a href="https://home.unicode.org/emoji/emoji-frequency/">map to emojis</a>. Of course, this can cause some lost-in-translation issues. The palm tree emoji may appear different on your device than a friends. In extreme cases, your friends device may not see a palm tree at all or see a completely different emoji. Part of the process involves <a href="https://unicode.org/emoji/proposals.html">submitting a proposal for a new emoji</a>. But the second half of that process means that computer scientists need to support newly approved emojis by updating their software. And, of course, in order to do that computer scientists need to have a firm understanding of functions!</p>
<h2 id="pythons-unicode-conversion-functions">Pythons Unicode conversion functions</h2>
<p>Python has two built-in functions that implement the (partial) mapping between characters and their Unicode number. The first is <code>ord</code>, which takes a single-character string and returns its Unicode number as an <code>int</code>.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1"></a><span class="op">&gt;&gt;&gt;</span> <span class="bu">ord</span>(<span class="st">&#39;A&#39;</span>)</span>
<span id="cb1-2"><a href="#cb1-2"></a><span class="dv">65</span></span>
<span id="cb1-3"><a href="#cb1-3"></a><span class="op">&gt;&gt;&gt;</span> <span class="bu">ord</span>(<span class="st">&#39;é&#39;</span>)</span>
<span id="cb1-4"><a href="#cb1-4"></a><span class="dv">233</span></span>
<span id="cb1-5"><a href="#cb1-5"></a><span class="op">&gt;&gt;&gt;</span> <span class="bu">ord</span>(<span class="st">&#39;&#39;</span>)</span>
<span id="cb1-6"><a href="#cb1-6"></a><span class="dv">9829</span></span></code></pre></div>
<p>The second is <code>chr</code>, which computes the <em>inverse</em> of <code>ord</code>: given an integer representing a Unicode number, <code>chr</code> returns a string containing the corresponding character.</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb2-1"><a href="#cb2-1"></a><span class="op">&gt;&gt;&gt;</span> <span class="bu">chr</span>(<span class="dv">65</span>)</span>
<span id="cb2-2"><a href="#cb2-2"></a><span class="co">&#39;A&#39;</span></span>
<span id="cb2-3"><a href="#cb2-3"></a><span class="op">&gt;&gt;&gt;</span> <span class="bu">chr</span>(<span class="dv">233</span>)</span>
<span id="cb2-4"><a href="#cb2-4"></a><span class="co">&#39;é&#39;</span></span>
<span id="cb2-5"><a href="#cb2-5"></a><span class="op">&gt;&gt;&gt;</span> <span class="bu">chr</span>(<span class="dv">9829</span>)</span>
<span id="cb2-6"><a href="#cb2-6"></a><span class="co">&#39;&#39;</span></span></code></pre></div>
<p>Unicode representations are a source of one common source of surprise for Python programmers: string ordering comparisons (<code>&lt;</code>, <code>&gt;</code>) are based on Unicode numeric values! For example, the Unicode value of <code>'Z'</code> is 90 and the Unicode value of <code>'a'</code> is 97, and so the following holds:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1"></a><span class="op">&gt;&gt;&gt;</span> <span class="st">&#39;Z&#39;</span> <span class="op">&lt;</span> <span class="st">&#39;a&#39;</span></span>
<span id="cb3-2"><a href="#cb3-2"></a><span class="va">True</span></span>
<span id="cb3-3"><a href="#cb3-3"></a><span class="op">&gt;&gt;&gt;</span> <span class="st">&#39;Zebra&#39;</span> <span class="op">&lt;</span> <span class="st">&#39;animal&#39;</span></span>
<span id="cb3-4"><a href="#cb3-4"></a><span class="va">True</span></span></code></pre></div>
<p>This means that sorting a collection of strings can seem alphabetical, but treats lowercase and uppercase letters differently:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1"></a><span class="op">&gt;&gt;&gt;</span> <span class="bu">sorted</span>({<span class="st">&#39;David&#39;</span>, <span class="st">&#39;Mario&#39;</span>, <span class="st">&#39;Jacqueline&#39;</span>})</span>
<span id="cb4-2"><a href="#cb4-2"></a>[<span class="st">&#39;David&#39;</span>, <span class="st">&#39;Jacqueline&#39;</span>, <span class="st">&#39;Mario&#39;</span>]</span>
<span id="cb4-3"><a href="#cb4-3"></a><span class="op">&gt;&gt;&gt;</span> <span class="bu">sorted</span>({<span class="st">&#39;david&#39;</span>, <span class="st">&#39;Mario&#39;</span>, <span class="st">&#39;Jacqueline&#39;</span>})</span>
<span id="cb4-4"><a href="#cb4-4"></a>[<span class="st">&#39;Jacqueline&#39;</span>, <span class="st">&#39;Mario&#39;</span>, <span class="st">&#39;david&#39;</span>]</span></code></pre></div>
<!-- Python also provides us with a built-in function to convert an integer value into a binary value.
This lets us see the sequence of 0s and 1s that would have been used decades ago when making holes in punch tape.
The built-in function is called `bin`, and we need to pass it an integer.
```python
>>> bin(0)
'0b0'
>>> bin(1)
'0b1'
>>> bin(2)
'0b10'
```
The result is a string where a sequence of 0s and 1s are prefixed with `'0b'`.
Let us now use this to find out the binary sequence of the letters `A` and `a`.
```python
>>> unicode_A = ord('A')
>>> bin(unicode_A)
'0b1000001'
>>> bin(ord('a'))
'0b1100001'
``` -->
<h2 id="references">References</h2>
<ul>
<li><a href="https://www.ascii-code.com/">ASCII Code: The extended ASCII table</a></li>
<li><a href="https://unicode-table.com/en/">Unicode Character Table</a></li>
</ul>
</section>
<footer>
<a href="https://www.teach.cs.toronto.edu/~csc110y/fall/notes/">CSC110 Course Notes Home</a>
</footer>
</body>
</html>