Missing Semester Lecture 9 - Security and Cryptography
MIT The Missing semester Lecture of Your CS Education Lecture 9 - Security and Cryptography
Hash functions
A cryptographic hash function maps data of arbitrary size to a fixed size, and has some special properties. A rough specification of a hash function is as follows:
1 | hash(value: array<byte>) -> vector<byte, N> (for some fixed N) |
An example of a hash function is SHA1, which is used in Git. It maps arbitrary-sized inputs to 160-bit outputs (which can be represented as 40 hexadecimal characters). We can try out the SHA1 hash on an input using the sha1sum
command:
1 | $ printf 'hello' | sha1sum |
At a high level, a hash function can be thought of as a hard-to-invert random-looking (but deterministic) function (and this is the ideal model of a hash function). A hash function has the following properties:
- Deterministic: the same input always generates the same output.
- Non-invertible: it is hard to find an input
m
such thathash(m) = h
for some desired outputh
. - Target collision resistant: given an input
m_1
, it’s hard to find a different inputm_2
such thathash(m_1) = hash(m_2)
. - Collision resistant: it’s hard to find two inputs
m_1
andm_2
such thathash(m_1) = hash(m_2)
(note that this is a strictly stronger property than target collision resistance).
Note: while it may work for certain purposes, SHA-1 is no longer considered a strong cryptographic hash function. You might find this table of lifetimes of cryptographic hash functions interesting. However, note that recommending specific hash functions is beyond the scope of this lecture. If you are doing work where this matters, you need formal training in security/cryptography.
Applications
- Git, for content-addressed storage. The idea of a hash function is a more general concept (there are non-cryptographic hash functions). Why does Git use a cryptographic hash function?
- A short summary of the contents of a file. Software can often be downloaded from (potentially less trustworthy) mirrors, e.g. Linux ISOs, and it would be nice to not have to trust them. The official sites usually post hashes alongside the download links (that point to third-party mirrors), so that the hash can be checked after downloading a file.
- Commitment schemes. Suppose you want to commit to a particular value, but reveal the value itself later. For example, I want to do a fair coin toss “in my head”, without a trusted shared coin that two parties can see. I could choose a value
r = random()
, and then shareh = sha256(r)
. Then, you could call heads or tails (we’ll agree that evenr
means heads, and oddr
means tails). After you call, I can reveal my valuer
, and you can confirm that I haven’t cheated by checkingsha256(r)
matches the hash I shared earlier.
Key derivation functions
A related concept to cryptographic hashes, key derivation functions (KDFs) are used for a number of applications, including producing fixed-length output for use as keys in other cryptographic algorithms. Usually, KDFs are deliberately slow, in order to slow down offline brute-force attacks.
Applications
- Producing keys from passphrases for use in other cryptographic algorithms (e.g. symmetric cryptography, see below).
- Storing login credentials. Storing plaintext passwords is bad; the right approach is to generate and store a random salt
salt = random()
for each user, storeKDF(password + salt)
, and verify login attempts by re-computing the KDF given the entered password and the stored salt.
Symmetric cryptography
Hiding message contents is probably the first concept you think about when you think about cryptography. Symmetric cryptography accomplishes this with the following set of functionality:
1 | keygen() -> key (this function is randomized) |
The encrypt function has the property that given the output (ciphertext), it’s hard to determine the input (plaintext) without the key. The decrypt function has the obvious correctness property, that decrypt(encrypt(m, k), k) = m
.
An example of a symmetric cryptosystem in wide use today is AES.
Applications
- Encrypting files for storage in an untrusted cloud service. This can be combined with KDFs, so you can encrypt a file with a passphrase. Generate
key = KDF(passphrase)
, and then storeencrypt(file, key)
.
Asymmetric cryptography
The term “asymmetric” refers to there being two keys, with two different roles. A private key, as its name implies, is meant to be kept private, while the public key can be publicly shared and it won’t affect security (unlike sharing the key in a symmetric cryptosystem). Asymmetric cryptosystems provide the following set of functionality, to encrypt/decrypt and to sign/verify:
1 | keygen() -> (public key, private key) (this function is randomized) |
The encrypt/decrypt functions have properties similar to their analogs from symmetric cryptosystems. A message can be encrypted using the public key. Given the output (ciphertext), it’s hard to determine the input (plaintext) without the private key. The decrypt function has the obvious correctness property, that decrypt(encrypt(m, public key), private key) = m
.
Symmetric and asymmetric encryption can be compared to physical locks. A symmetric cryptosystem is like a door lock: anyone with the key can lock and unlock it. Asymmetric encryption is like a padlock with a key. You could give the unlocked lock to someone (the public key), they could put a message in a box and then put the lock on, and after that, only you could open the lock because you kept the key (the private key).
The sign/verify functions have the same properties that you would hope physical signatures would have, in that it’s hard to forge a signature. No matter the message, without the private key, it’s hard to produce a signature such that verify(message, signature, public key)
returns true. And of course, the verify function has the obvious correctness property that verify(message, sign(message, private key), public key) = true
.
Applications
- PGP email encryption. People can have their public keys posted online (e.g. in a PGP keyserver, or on Keybase). Anyone can send them encrypted email.
- Private messaging. Apps like Signal and Keybase use asymmetric keys to establish private communication channels.
- Signing software. Git can have GPG-signed commits and tags. With a posted public key, anyone can verify the authenticity of downloaded software.
Key distribution
Asymmetric-key cryptography is wonderful, but it has a big challenge of distributing public keys / mapping public keys to real-world identities. There are many solutions to this problem. Signal has one simple solution: trust on first use, and support out-of-band public key exchange (you verify your friends’ “safety numbers” in person). PGP has a different solution, which is web of trust. Keybase has yet another solution of social proof (along with other neat ideas). Each model has its merits; we (the instructors) like Keybase’s model.
SSH
We’ve covered the use of SSH and SSH keys in an earlier lecture. Let’s look at the cryptography aspects of this.
When you run ssh-keygen
, it generates an asymmetric keypair, public_key, private_key
. This is generated randomly, using entropy provided by the operating system (collected from hardware events, etc.). The public key is stored as-is (it’s public, so keeping it a secret is not important), but at rest, the private key should be encrypted on disk. The ssh-keygen
program prompts the user for a passphrase, and this is fed through a key derivation function to produce a key, which is then used to encrypt the private key with a symmetric cipher.
In use, once the server knows the client’s public key (stored in the .ssh/authorized_keys
file), a connecting client can prove its identity using asymmetric signatures. This is done through challenge-response. At a high level, the server picks a random number and sends it to the client. The client then signs this message and sends the signature back to the server, which checks the signature against the public key on record. This effectively proves that the client is in possession of the private key corresponding to the public key that’s in the server’s .ssh/authorized_keys
file, so the server can allow the client to log in.
Exercises
Entropy.
Suppose a password is chosen as a concatenation of five lower-case dictionary words, where each word is selected uniformly at random from a dictionary of size 100,000. An example of such a password is
correcthorsebatterystaple
. How many bits of entropy does this have?The entropy is equal to
log_2(# of possibilities)
. The total number of possibilities would be100,000 ** 5
(100,000 raised to the power of 5). Then the entropy equals approximately 83 bits.Consider an alternative scheme where a password is chosen as a sequence of 8 random alphanumeric characters (including both lower-case and upper-case letters). An example is
rg8Ql34g
. How many bits of entropy does this have?For an alphanumeric character, there are in total
26+26+10=62
possibilities. Since there are 8 random characters, the total number of possibilities would be62 ** 8
. Ten the entropy equals approximately 48 bits.Which is the stronger password?
The first one is stronger since the entropy is higher.
Suppose an attacker can try guessing 10,000 passwords per second. On average, how long will it take to break each of the passwords?
In a year, the attacker can try guessing
365*24*3600*10000=315360000000
passwords. For the first password, it takes approximately(2**83)/315360000000=3*10^13
years. For the second password, it takes approximately(2**48)/315360000000=893
years.
Cryptographic hash functions. Download a Debian image from a mirror (e.g. from this Argentinean mirror). Cross-check the hash (e.g. using the
sha256sum
command) with the hash retrieved from the official Debian site (e.g. this file hosted atdebian.org
, if you’ve downloaded the linked file from the Argentinean mirror).1
2
3
4$ sha256sum debian-12.7.0-amd64-netinst.iso
8fde79cfc6b20a696200fc5c15219cf6d721e8feb367e9e0e33a79d1cb68fa83 debian-12.7.0-amd64-netinst.iso
$ cat SHA256SUMS.txt | grep debian-12.7.0-amd64-netinst.iso
8fde79cfc6b20a696200fc5c15219cf6d721e8feb367e9e0e33a79d1cb68fa83 debian-12.7.0-amd64-netinst.isoThrough
sha256sum
, we can find the downloaded file should be a correct one.Symmetric cryptography. Encrypt a file with AES encryption, using OpenSSL:
openssl aes-256-cbc -salt -in {input filename} -out {output filename}
. Look at the contents usingcat
orhexdump
. Decrypt it withopenssl aes-256-cbc -d -in {input filename} -out {output filename}
and confirm that the contents match the original usingcmp
.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16$ touch test_aes
$ vim test_aes # content: test_aes
$ openssl aes-256-cbc -salt -in test_aes -out test_aes_out
enter aes-256-cbc encryption password:
Verifying - enter aes-256-cbc encryption password:
*** WARNING : deprecated key derivation used.
Using -iter or -pbkdf2 would be better.
$ cat test_aes_out
▒9:▒h▒▒2▒#▒R▒▒vn▒
$ openssl aes-256-cbc -d -in test_aes_out -out test_aes_in
enter aes-256-cbc decryption password:
*** WARNING : deprecated key derivation used.
Using -iter or -pbkdf2 would be better.
$ cat test_aes_in
test aes
$ cmp test_aes test_aes_in