To the batmobile math-mobile!^{1}
A “probability distribution” describes data associated with an uncertain process. The probability of heads and tails describes the uncertain process of flipping a coin. We distribute the total probability—which must add up to 100%^{2}—between the two possible outcomes. The distribution can also describe the relationship between inputs and outputs. For example, we could describe the process of a leaf falling to the ground with a distribution of starting heights and corresponding landing locations.
These “processes” don’t have to be real—they just have to be useful. We can imagine that there is a fictional “bear process” that produces images of bears. The probability distribution would describe the complex relationship between all the individual pixels that makes them a bears versus anything else. This doesn’t exactly mirror a process in nature, but it is useful for modeling the “bearness” of an image.^{3}
For simple, theoretical distributions, like that of a coin flip, we can write a precise mathematical expression for the probability of every possible outcome. However, for complex problems like figuring out how to gauge the “bearness” of every possible set of pixel values, this is impossible.
Instead we rely on the “empirical distribution,” which is a collection of observations produced by the uncertain process we are studying. When we observe this data about the process, we are observing “samples” from the corresponding probability distribution. Importantly, we never observe the true distribution—we only see it implicitly as it appears in the data we collect. The more data we have, the better we can approximate the true distribution.
Modeling a probability distribution from observed samples lets us ask questions about the distribution or process. For example, we might ask, “How densely packed are bear images around this test image?” If the bear images are dense at this point, the test image probably contains a bear. If the bear images are sparse, the test image is probably not a bear. This is called “density estimation.”
In some cases, we might ask, “What other data could we expect from the underlying process?” We might want to generate more examples of where a leaf could land bear images. This is called “sampling.”^{4}
Finally, we might ask, “Given a partial or incomplete observation, what do we expect to fill in the blanks?” We could have a bear image that is missing pixels and want to find the likely values of those pixels to “fix” the picture. We might have information about where a leaf landed and want to find likely starting heights given those landing locations. Inversely, we might have information about the starting height and want to find likely landing points. This is called inference.
Here’s a neat little picture (with bears) to explain density estimation, sampling, and inference:
Thinking about things in terms of probability distributions makes you a nerd gives you the tools
to understand the world around you in new and powerful ways.
If you can describe a process or data in terms of probability distributions (called “modeling”),
you can leverage all the associated math to pose and answer complex questions—like “is this a bear.”
Don’t worry, I’m sure it’s just as cool or cooler than the batmobile. ↩
Forcing the total probability to add up to 100% is just a formal way of saying that something has to happen all of the time. We can’t say that a coin flip will be heads 40% of the time and tails 40% of the time, but nothing happens the remaining 20% of the time. That makes no sense. ↩
I suppose whether you think this is useful is a matter of personal conviction. ↩
You may have heard about “generative AI” or “generative models” that can do this fake data generation. ↩
Since I still write the code on my desktop, I use Makefiles to automatically copy the code to the remote cluster and submit the SLURM job. An example looks something like this:
default:
ssh myusername@thecluster.edu mkdir -p /home/myusername/myproject
rsync -av . myusername@thecluster.edu:/home/myusername/myproject
echo "#!/bin/bash" > tmp.sh
echo "#SBATCH --time 04:00:00" >> tmp.sh
echo "#SBATCH --job-name=myproject-name" >> tmp.sh
echo "#SBATCH --nodes=1" >> tmp.sh
echo "#SBATCH --ntasks=2 # cores" >> tmp.sh
echo -n "#SBATCH --partition=V4V32_SKY32M192_L" >> tmp.sh
echo "#SBATCH -e slurm-%j.err" >> tmp.sh
echo "#SBATCH -o slurm-%j.out" >> tmp.sh
echo "#SBATCH -A my-slurm-project-id-for-billing" >> tmp.sh
echo "#SBATCH --gres=gpu:1" >> tmp.sh
echo "python myfile.py ..." >> tmp.sh
scp tmp.sh myusername@thecluster.edu:/home/myusername/myproject/tmp.sh
rm tmp.sh
ssh myusername@thecluster.edu "cd myproject && sbatch tmp.sh"
To do this, I need to pick a “queue” or “partition” on the cluster to submit the job to (like V4V32_SKY32M192_L
).
These queues differ based on the CPU, GPU, or amount of memory.
However, I don’t have strict queue requirements.
I just want to use any of the open GPU queues.
To help me out, I can include a little bash script magic in my Makefile that will remotely pull all the SLURM queues and filter them
to find the idle GPU queues. The SLURM sinfo
command produces output like this:
SKY32M192_L up 14-00:00:0 6 drng skylake[026-030,045]
SKY32M192_L up 14-00:00:0 32 alloc skylake[001-002,005-006,009,014,017-018,021-022,025,031-034,036-040,042-044,046-054]
SKY32M192_D up 1:00:00 1 alloc skylake056
P4V16_HAS16M128_L up 3-00:00:00 1 comp gpdnode001
P4V16_HAS16M128_L up 3-00:00:00 1 idle gpdnode002
P4V12_SKY32M192_L up 3-00:00:00 1 comp gphnode002
P4V12_SKY32M192_L up 3-00:00:00 1 mix gphnode008
P4V12_SKY32M192_L up 3-00:00:00 4 alloc gphnode[001,004-006]
P4V12_SKY32M192_L up 3-00:00:00 2 idle gphnode[003,009]
P4V12_SKY32M192_D up 1:00:00 1 idle gphnode010
V4V16_SKY32M192_L up 3-00:00:00 2 alloc gvnode[001-002]
V4V32_SKY32M192_L up 3-00:00:00 4 mix gvnode[003-006]
CAL48M192_L up 14-00:00:0 1 mix cascade004
CAL48M192_L up 14-00:00:0 49 alloc cascade[002-003,005-043,045-052]
CAL48M192_L up 14-00:00:0 1 down cascade044
CAL48M192_D up 1:00:00 1 alloc cascade001
CAC48M192_L up 14-00:00:0 1 down* cascadeb014
CAC48M192_L up 14-00:00:0 59 alloc cascadeb[001-013,015-060]
V4V32_CAS40M192_L up 3-00:00:00 12 mix gvnodeb[001-012]
I want to grab the lines that show idle
and gvnode
(nodes with V100 GPUs).
Then I just want to extract the first token/word on the line to get the queue ID (like V4V32_SKY32M192_L
)
I can do this in my Makefile with the following line at the top before the default
line:
PARTITION := $(shell ssh myusername@thecluster.edu sinfo | grep idle | grep _L | grep gv*node | sort -r | head -1 | awk '{print $$1}')
Now the PARTITION
variable in the Makefile has the ID of the idle GPU queue! I use the grep
commands to filter down to the right lines,
then apply sort -r
to prioritize queues with more RAM. Finally, I use head -1
to get just the first line and awk
to grab the first word on the line.
Then, when I create my SLURM job script in my Makefile, I can just do:
echo -n "#SBATCH --partition=$(PARTITION)" >> tmp.sh;
Now I can easily ensure I use whatever idle queue is available!
]]>A decision tree is essentially a flow chart where each node asks a yes/no question about the data (e.g., “is the exit velocity > 95 mph?”). It could have multiple layers of “follow-up” questions that increase the resolution of our tree, but eventually we will reach a point where the tree gives us a prediction (e.g., a putout probability of 0.78) instead of asking another question. We can “grow” a decision tree for our prediction problem by first figuring out what input variable and value splits the corresponding target values as evenly as possible. For example, we may find that “launch angle > 40 deg?” splits the data into two groups where the “yes” group was a putout 80% of the time, and the “no” group was a putout 20% of the time. For each group, we could add a second question to further narrow down the prediction. Ideally, we want groups as “pure” as possible, either all putouts or not putouts. However, each new branch of the tree only looks at the remaining examples. Eventually, there will only be one example left in each branch.
We can fix this by using a series of small trees which look at all the data and “fix” (or boost) the previous trees. We start by having our model predict a constant value. For example, our fielder could have a 0.6 putout rate on all balls his direction. Next, we boost this prediction by creating a tree that corrects this constant output. The tree might notice that for “launch angle > 40 deg” 0.6 tends to be too low and output a correction of +0.2. For “launch angle <= 40 deg” the tree may add a correction of -0.4. We can repeat this process to boost the boosted tree. If “launch angle <= 40 deg,” but “exit velocity > 75 mph” the predicted rate of 0.6 – 0.4 = 0.2 may be too low, so the tree adds a second correction of 0.3, and final prediction is 0.6 – 0.4 + 0.3 = 0.5. This is better than a single tree because each successive boosting tree can make its decision by looking at the error across all the data points rather than just those remaining in each branch.
The last step is defining exactly what “error” each tree should correct. While it might seem obvious that the error is just the difference between actual and predicted, this might vary from problem to problem. A more general approach is to define a “loss function” that we want to minimize (i.e., it tells us the “goodness” of a prediction). The negative gradient of the loss tells us how we should change the prediction to reduce the loss. Thus, we want each successive tree to output the negative gradient (scaled, so we don’t correct too much) to move the prediction a little closer to the goal.
In less technical language:
Decision trees are flow charts that ask yes/no questions about the input data to make predictions. For example, we could predict a fielder’s putout rate by creating a flow chart asking if the launch angle exceeded 40 deg or if the exit velocity was less than 75 mph and making our prediction based on how the fielder performed in that scenario. However, just one tree (even a big one) has limits, but we can circumvent this by creating many trees that each slightly correct the prediction of the previous one. If our data has relatively few variables, this approach gives us a straightforward, understandable model for making predictions about the fielder’s putout rate.
Let’s pretend the blockchain is a LEGO tower, and the network of computers is a group of friends. They don’t want one person to be in control of adding new blocks to the tower, because that person could just take over and build whatever they want. To prevent this, each person builds a copy of the tower and adds new blocks to it independently. Every time a person adds a block, that person lets everyone else know and specifies which block the new one should go on top of. Everyone then adds the block to their personal tower to keep things in sync.
What happens if two people add a new block at the same time? Which block should we use? We use both, and the tower temporarily splits. When someone adds the next block, they will have to pick which branch to build on top of. Once they pick a branch, we will discard the other, shorter branch. We always pick the tallest tower.
However, even with this approach, one person could still monopolize construction. In fact, they could erase and rebuild whole sections of the tower by picking an old block to start from and adding enough new blocks on top of it to make this new branch the tallest one. Everyone else would then discard the original (and now shorter) branch of the tower in favor of the new one. Thus, if one person wanted to replace the last five blocks of the tower, that person would just pick the fifth block from the top as the new starting point and add six new blocks before anyone else added one block to the original tower.
The most popular method of preventing this is called proof of work, which means that we make it difficult and time-consuming to add new blocks. If it takes a long time to make a new block, the chance of a bad actor^{1} making six blocks before anyone else makes just one is extremely low. The bad actor would need more computing power than everyone else combined.
How do cryptocurrencies use the blockchain? For crypto, each block of the tower contains a list of recent transactions. If we follow the transactions from the base of the tower up to the tallest branch, we can trace the flow of all the “money” and figure out how much everyone has in their wallets right now. When someone initiates a new transaction, they add it to a list of pending transactions. Once this list gets long enough, people in the network try to “mine” a new block that contains the transactions and that fits onto the top of the tower. Creating a block takes a long time because the new block must perfectly fit onto the existing top of the tower. Thanks to some math from cryptography (hence the name crypto), it is pretty easy to check if a block fits on the tower but impossible to calculate a block ahead of time that you know will fit. Thus, you have to use trial and error to look for a valid block, which takes a long time and a lot of computing power.
The difficulty of finding valid blocks means that once a block of transactions is added to the crypto chain, it is virtually impossible to go back and “undo” old transactions to steal back any money. This would require that you go back to the block you want to replace, re-mine it and all of the blocks that come after it before anyone else mines just one more block on top of the original chain. This is absurdly difficult.
Why go through all the trouble of building a blockchain? For many people, the appeal of a blockchain (and crypto, by extension) is that no single person is “in charge” of the chain. It is decentralized. Since no one is in control, you don’t have to trust anyone who participates in the chain. In contrast, if a central bank keeps a record of transactions to decide how much money you have, you have to trust the bank’s accounting. If the bank nefariously decides to undo a transaction (or if someone hacks into it), you’re sunk. Of course, in an economy where people have the freedom of choosing a bank, a bank that occasionally dips into your funds would quickly fold since people would just hop to another bank as quickly as possible. However, if you are particularly paranoid that the government will take control of the financial system or freeze your funds, crypto might seem worth a look.
Clearly, crypto itself has no intrinsic value. The worth is dictated by how valuable people decide this decentralized currency idea is. If you trust the centralized governments that issue and sustain the value of traditional currency, then crypto is a bit ridiculous. Why would you give up the security of a currency backed by the United State of America in favor of digital monopoly money? However, if you don’t trust the governments of the world to regulate the value of currency, then crypto might be appealing. The tradeoff is that the value of your digital assets is now at the mercy of a distributed network that lacks a central authority declaring it legal tender.
In economic terms, if we collectively decide that the idea behind crypto is important, the value will go up because we will be willing to spend more dollars to get the same amount of crypto. However, if we decide crypto isn’t useful, we won’t be willing to convert our dollars into it, and supply and demand dictates that the value of the crypto will have to drop to reenergize demand.
Obviously, a high level overview like this can’t fully capture the nuances of blockchain and crypto. However, understanding the mechanics of the blockchain can clear up the confusion around this trendy topic and remind everyone that getting into crypto doesn’t have much to do with you being the next bold explorer.
Should we tell Matt Damon?
Probably Nicholas Cage. ↩
All three of these elements affect the final result. Just a few good examples could make a huge model perform well, while a ton of bad data could make the same model fail miserably. Changing up the optimization algorithm could let us find a good solution faster or miss it entirely. Using a model designed for the wrong task will put the whole system behind the sticks from the start, regardless of how good the data or optimization is.
One way to visualize this whole process is designing a landscape so that if you randomly drop a marble it will always end up at the exact spot you want it to go to. The marble is the solution that deep learning achieves, and we want it to end up at the target location no matter where it starts. The target location represents the solution we want it to find. The world that we build controls what happens to the marble and how consistently close it can get to the desired destination.
The data defines the mountains and valley of the landscape. With good, clean data, we might have terrain that directs the marble to the right solution every single time. However, we likely have noisy data that translates into bumpy ground that might even have big holes that could swallow up our marble before it gets close to the goal.
Changing the model might involve putting constraints on what the solutions can look like. This would be like planting a row of trees in some areas of the landscape that physically prevent the marble from reaching those positions. This keeps the marble from exploring regions that we know are pointless. (In technical terms, models have inductive bias that limits the possible solutions they can find.)
Changing the optimization algorithm would be like changing the physics of this imaginary world. Maybe the marble rolls faster, which means it could get to the desired destination quickly but then overshoot the target. Alternatively, the marble might roll slowly enough to always hit the right solution, but so slowly the approach isn’t practical.
Making deep learning models do the right thing means sculpting this imaginary landscape and recognizing that each part of the system influences the overall success. As an example, one area of deep learning recognizing something even if you only have a few examples of it (few-shot learning). While you could train a model on a bunch of examples of things you know about, that model will not do well when shown examples of things it hasn’t seen before. Instead, we can switch up the approach to the data by showing the network a few examples of a lot of different things during training. The model then learns more about what differentiates things rather than the specific characteristics of a few categories. We didn’t change the model at all, but the resulting solution is much better suited for our desired task.
Visualizing deep learning as a virtual landscape that can be molded and tweaked helps me build intuition for how all these factors work together. Maybe recognizing the interconnectedness of these three aspects of deep learning can help us explore new, unique research ideas that could solve some of the biggest problems in artificial intelligence today.
Or maybe I should just go into landscaping :seedling:
]]>roate
, arose
, soare
, later
, and adieu
.
Unfortunately, none of them are the best initial guess.
Instead, I would like to propose that the best initial guess for Wordle is, unequivocally, …
… complicated.
Wait, don’t leave, I’ll explain. The whole question hinges on a critically important word but not a five-letter word. It’s the word “best”–a word that is straight-up dangerous at worst and absurdly ambiguous and unhelpful at bes–wait, uh, nevermind.
For Wordle, “best” could mean any number of things. The best starting word could be:
This is not an exhaustive list, either. The point is that it doesn’t make sense to claim that a word is the “best” without carefully defining the problem you are solving. With a clear definition, you will be able to tell whether a particular solution is truly the best you can do. To make matters even worse, even if you can define the problem, that doesn’t mean you can practically find the solution. If that’s the case, you have to make simplying assumptions so the problem becomes “tractable.”
Of course, the usefulness of your solution depends on how realistic your assumptions are. For example, in the case of Wordle, a reasonable assumption might be that solving the puzzle is nearly equivalent to finding the five letters in the word (regardless of order) since there aren’t many anagrams of the same five letters. Once you know the letters, you can quickly figure out the solution. This rule of thumb that simplifies the problem is known as a heuristic.
To be fair, most of the in-depth articles about Wordle starting guesses do recognize that there isn’t just one way to approach the question,
and it’s possible that a word could be “best” for multiple objectives.
Personally, I like starting with arose
or soare
and the following it up with unity
. Those words cover the vowels + y
and some more common consonants.
Obviously, the lesson here isn’t limited to Wordle. We encounter claims of “best” all the time, whether in advertising or decision making. Properly evaluating these claims requires that we ask the question, “Best for what?” or “Best in what sense?” Discerning the answer is critical for understanding the significance of the claims and the action steps that should follow.
Like what word you should guess first in a viral online word game.
The best place would be somewhere on the opposite side of the Earth from the Sun in an orbit with the same period (duration) as the Earth. This would keep the object in the same position relative to the Earth and the Sun, allowing us to predictably point it away from both and always stay in contact.
We need to know a few things to solve this problem. First, since moving objects don’t change direction without an outside force,^{1} if an object moves in a circle (constantly changing direction), there must be a constant force acting on it. We call this the centripetal force. It’s pretty easy to calculate:
\[F_c = \frac{mv^2}{r}\]For objects in orbit, gravity provides the centripetal force. We can calculate the force of gravity with:
\[F_g = \frac{Gm_1 m_2}{r^2}\]where $G$ is the universal gravitational constant, $m_1$ and $m_2$ are the masses of the objects, and $r$ is the distance between them. Since gravity is providing the centripetal force, we can then set these two equations equal to each other, since the forces should be equal. I’ll use $M$ to represent the mass of the body being orbited and $m$ to represent the mass of the thing doing the orbiting.
\[\frac{GMm}{r^2} = \frac{m v^2} {r}\]Before we go further, it isn’t very intuitive to work with velocity $v$ in this expression. It would be more convenient to instead think of the angular velocity of the object, which is directly connected to how quickly the object completes an orbit.
\[\frac { v\text{ meters}}{1 \text{ second}} \times \frac{1 \text{ orbit}}{2\pi r \text{ meters}} \times \frac{360 \text{ degrees}}{1\text{ orbit}} = \frac{180 v}{\pi r}\ \text{degrees}/\text{second} = \omega\]Next, we solve for $v$ so we can substitute that into our force equation:
\[v = \frac{\pi r \omega}{180}\]Substituting we get
\[\begin{aligned} \frac{GMm}{r^2} &= \frac{m (\pi r \omega / 180)^2} {r} \\ &= \frac{m \pi^2 \omega^2 r^2}{180^2 r} \\ &= \frac{m\pi^2\omega^2r}{180^2} \end{aligned}\]We can simplify this even more since $m$ appears in the numerator of both sides–we can divide it out.
\[\frac{GM}{r^2} = \frac{\pi^2\omega^2r}{180^2}\]This is great–it means that our result doesn’t depend on the mass $m$ of the orbiting object and will hold for any spacecraft of any size.
Let’s see how we can use this equation with a simple example. We want to launch a satellite to the right altitude such that it moves at the same speed that the Earth rotates. It will appear to hover over one location–we call this a geostationary orbit. The Earth rotates at a rate of once per day, or $360 \text{ degrees} / 86400 \text{ seconds}$ which is $0.0042\text{ degrees} / \text{second}$. Next, we need to solve for $r$:
\[\begin{aligned} r^3 &= \frac{180^2GM}{\pi^2\omega^2 } \\ \implies r &= \left(\frac{180^2GM}{\pi^2\omega^2 }\right)^{1/3} \end{aligned}\]We know that $M = 5.97 \times 10^{24} \text{ kg}$ and $G = 6.67 \times 10^{-11}\ \text m^3\text{kg}^{-1}\text s^{-2}$,^{2} so we can plug everything in and get:
\[r = 42,000 \text{ km}\]Subtracting 6380 km for the radius of the Earth, we find that a geostationary orbit has an altitude of 35,800 km above the Earth’s surface (22,200 miles).
In the original problem, though, we want to find an orbit around the Sun that matches the Earth’s year, so it always stays in the same relative position with the Earth. At first, this might seem like an impossibility. To keep up with the Earth, wouldn’t the object need to be in the exact spot the Earth is in?
Yes, except we can’t ignore the affect of Earth’s gravity in addition to that of the Sun–there are now three objects to keep track of: the Sun, the Earth, and the spacecraft.
We’ll start by updating our gravity-centripetal force equation to include the Sun on the left-hand (gravity) side:
\[\frac{GM_s}{r_s^2}+\frac{GM_e}{d_e^2} = \frac{\pi^2\omega^2r_s}{180^2}\]$M_s$ is the mass of the Sun, $M_e$ is the mass of the Earth, $r_s$ is the spacecraft’s distance from the Sun, and $d_e$ is the distance of the spacecraft from the Earth. If we want to do 360 degrees around the Sun in one Earth year (the same speed at which the Earth is moving), we end up with $\omega = 1.14 \times 10^{-5} \text{ degrees}/\text{second}$. We also know that $d_e = r_s - r_e$ where $r_e$ is the radius of the Earth’s orbit around the Sun:
\[\frac{GM_s}{r_s^2}+\frac{GM_e}{(r_s - r_e)^2} = \frac{\pi^2\omega^2r_s}{180^2}\]We want to solve for $r_s$, which, it turns out, is not easy. Luckily, with some handy Python code, we find
\[r_s - r_e = 1.4\times 10^6\ \text{km}\]which is about 900,000 miles. Thus, if JWST flies to this point 900,000 miles away from the Earth, it can orbit the Sun with the same period as the Earth and stay in the same relative position. It will always be able to face away from both the Sun and Earth while staying in constant communication contact (nothing will be blocking it).
import numpy as np
from scipy.optimize import fsolve
gravity_const = 6.67e-11
mass_sun = 1.99e30 # kg
mass_earth = 5.96e24 # kg
radius_earth_orbit = 1.5e11 # meters
angular_speed = 2 * np.pi / (365.25 * 86400) # rad/s
def func(x):
return (
gravity_const * mass_sun / x ** 2
+ gravity_const * mass_earth / (x - radius_earth_orbit) ** 2
- (angular_speed ** 2) * x
)
guess = radius_earth_orbit + 1e3
out = fsolve(func, guess)[0]
distance_from_earth = out - radius_earth_orbit
print(f"{distance_from_earth/1e9:0.1f} ⨉ 10⁹ m")
If you don’t have a deep learning background, everything about this will be confusing if it isn’t already.
Sorry.
Anyway, “attention” in deep learning is a very specific term that refers to a network learning what parts of the input to pay … attention to. I’ll try to explain this more in the rest of this article.
It’s actually quite easy.
(Honestly, the paper is pretty good, so kudos to the authors on that, but you do need a lot of background in machine learning!)
Now that we’ve got that out of the way, let’s get on with it!
Your job is to translate a sentence into a different language. Let’s think through a very structured way to approach this problem in the hopes that we can make a robot that replaces you. Oops, haha – don’t worry about that last part it’s fine.
The first thing you do is take a good look at each word in the original sentence and make some notes about each one based on the words around it. With your notes in hand, you start on the translation and find a scratchpad so you can jot down things as you go.
To generate each successive word of the translation, you first need to figure out which of your notes you should use. You check your scratchpad to see what topic or concept you’ve been talking about, compare it with all your notes, and then combine the relevant notes into a little cheat sheet. This cuts down on the chances you’ll be distracted by irrelevant information.
Next, you use your cheat sheet to update the topic on the scratchpad before finally using the cheat sheet and the scratchpad to pick the next word of the translation.
The process repeats for the next word. Since you’ve kept some thoughts on the scratchpad, you can make sure the following word makes sense by pulling the relevant notes, updating the scratchpad, and writing the new word. Repeat this until the end of the sentence and boom, you’ve got your translation!
To do this with a computer, we show an AI a ton of sentences and teach it how to pick out relevant notes to build its cheat sheet, update the scratchpad, and pick out the next word based on this info. We give it feedback by comparing its translations to the real ones. Now, we can move on to formalizing the attention mechanism a bit more, but first …
You probably shouldn’t read anymore unless you’ve taken a machine learning class. Really. I won’t think any less of you. This is more like the fine print insurance section anyway so the machine learning hawks don’t accuse me of oversimplifying.^{1} Plus, you’ve already got plenty to use to pizzazz your relatives with your knowledge of AI at the next family get together. Stop while you’re ahead. Also, there’s math, so watch out for that.
… aaaaaand if you’re still here, buckle up for some step-by-step instructions and deep learning jargon:
Here’s a quick diagram that shows all the connections. The dotted lines mean the previous value of the variable, and the arrows pointing into a variable indicate the quantities used to calculate the next value.
I hope this was a gentle introduction that removes some of shroud of mystery around attention in deep learning, especially in the context of machine translation. I skimmed over the implementations details here, but this should give you the basic idea. For more details, check out the paper I linked to near the top!
]]>It is difficult for the human mind to comprehend how wonderful it is to have all your training runs fully logged and backed up, ready for generating figures or testing trained networks. Take it a step further and find an online service that backs it up remotely and shows you live plots during training. Try neptune.ai if you’re not sure where to start.
Testing locally with a CPU is super helpful and building in GPU support will ensure you can drop your code on a remote GPU machine (e.g., from your university) and take advantage of the extra power. If you don’t have access to GPU resources, there are some online options you could try. You might be tempted to use Google Colab notebooks but
That includes Google Colab, even with the allure of free GPUs. Let us enumerate the reasons:
GPUs. If you are a student or faculty member, your university likely has a computing center with remote GPU access (possibly for free!). You can’t run a notebook on these easily, if at all. Don’t leave a several-thousand-dollar GPU on the table just because you don’t like .py
files.
Stares down Google Colab with malice:
Jupyter Notebooks are good for exactly three things:
That’s it.
See previous discussion under my notebook rant. If you could conceivably want to tweak a parameter, make it a command line argument.
See previous discussion under said notebook rant. When used with GitHub, you get code backup and change tracking. Use branches. Ruin your model by trying to change up the architecture? git checkout before-i-messed-it-up
.
PyTorch is nice for customizing models and training loops. The simplicity you can get from Keras (and therefore TensorFlow) might seem cool, but why not have PyTorch and that convenience? PyTorch Lightning is awesome and makes logging, training, testing, checkpoint loading, GPU usage, and really just about everything so much easier. Check it out!
I use black
, and VS Code formats my file every time I hit save. This does wonders for keeping your code organized, readable, and consistent. It makes me like looking at my code.
Don’t go overboard, since your code is a living document, but if you have some utils or cryptic function names and argument lists, it doesn’t hurt to throw in a Google-style docstring comment:
def my_cool_function(
out: torch.Tensor,
target: torch.Tensor,
flux_factor: float,
gravity_scaling: float):
"""Computes the potential energy difference of a
semi-massive charged particle on a moon with gravity
given as a multiple of Europa's with a fudge factor
to account for the magnetic field flux of who knows what.
Args:
out (torch.Tensor): Model prediction.
target (torch.Tensor): Target values.
flux_factor (float): Fudge factor to account
for magnetic field.
gravity_scaling (float): Ambient gravitational
acceleration normalized to that of Europa
"""
throw RuntimeException("This is a useless function")
With a handy IDE, your documentation will popup whenever you call your function so you don’t accidently pass in gravity normalized by Titan instead of Europa (as one does).
Use the typing
library and Python’s support for type annotations. Besides making code easier to read, this will let the IDE flag any possible problems. Let it help you.
To be fair, I’ve been rather direct in my criticism for effect. However, even if you don’t agree on every point, make sure you continue to explore ways to improve your code organization and management along with workflow efficiency!
]]>
Let me reiterate. This is the correct way to write a numeric date. All other ways are wrong. You will also believe that by the time you are done reading this.
First, though, what is this format exactly?
Americans, I have some bad news. There is an international standards organization, creatively called the International Standards Organization (ISO), and it is not in the United States. It is in Switzerland,^{1} Geneva to be precise. In 1988, ISO published a specification of an international standard for writing numeric dates. Although there is a little more nuance to what’s allowed, in general, an ISO 8601 date looks like
2021-04-24T17:43:32.123Z-04:00
Let’s break it down:
2021-04-24
The date formatted as YYYY-MM-DD
. Technically, you can also use YYYYMMDD
but this is much less readable.T
A delimiter to separate the date from the time.17:43:32.123
The time formatted as hh:mm:ss.sss
. If you don’t need to be that precise, you can also use hh:mm
or hh:mm:ss
. You could omit the colons, but again, that seems less readable.Z
A delimiter that specifies the time zone in relation to UTC (Greenwich Mean Time, a.k.a. London time).-04:00
The time zone offset from UTC. You could also write this as ±hhmm
, ±hh
, or leave it off entirely if you want to keep the time in UTC.T
to get 2021-04-24
, which is a valid ISO date in its own right. Or I could trim down the time precision and stick to local time and get 2021-04-24T17:53
.YYYY-MM-DD
, they will sort themselves by date automatically. Obviously, you won’t need this for everything, but if you want to name blog posts, reports, invoices, letters, etc., it comes in handy.But most importantly …
I can’t stress this enough. If there is a standard that does what you are trying to do, then please please use it. Here are a few reasons:
Of course, I have a software engineering slant to this post, but the same idea applies to any project. Pick a standard file format for project managements reports.^{2} Write personal notes in Markdown. Keep track of metadata in YAML files. Publish digital books with EPUB3 (which supports audio!). All that could seem a bit intimidating at first, so just start small, and remember that today is not 4/24/2021, 2021/04/24, 04242021, or 04.24.21 – it’s 2021-04-21.
]]>