Energy Based Models
- There is some scalar valued energy function F(x,y)
  - Measures the compatibility between x and y
  - low values: y is a good prediction of x
- training Energy based models
  1. parameterize F(x,y)
  2. get training data
  3. shape F(x,y) so that:
    1. F(x[i],y[i]) is strictly smaller than F(x[i], y) for all y different from y[i]
    2. F is smooth
- Two learning methods
  - Contrastive methods- push down on points F(x[i], y[i]) for push up on other values of y
  - Architectural Methods- build F(x,y) so that the volume of low energy regions is limited or minimized through regularization
Auto Encoder - Auto encoders are unsupervised models that encode the input data, and decode back to the original input data - When we do this encode-decode cycle there is some inaccuracies known as reconstruction loss - basic auto-encoders allows the model to learn a mapping of a whole region rather than a single point. It works of a continuous space - Some methods like the DrSAE has a regularization term within the decoding function to prevent overfitting on the input data
What does regularization mean?
- when you do optimization based on some loss function you tend to make certain things happen. you add a regularization term to prevent those things from happening.

Generative Adversarial Nets

Idea
- we want to generate some highly realistic samples, using samples from our model
- Problem is we can’t directly pull a sample from a complex distribution
- Use noise to create the sample
GAN uses a sampler which precedes the generator that determines the cost
- Variational Auto-encoders have their sampler after the encoder
GAN has two conflicting systems
- Generator
  - Uses gaussian noise to create a random sample, hopefully that is close to the real data
  - Model collapse
    - the generator successfully fools the discriminator. then the generator continues to input fake data very close to the previous good fake data
    - This causes the data to become very inefficient and biased for several iterations before the discriminator loss function figures out what is going on
- Discriminator
  - The Discriminator uses the real data to make a prediction everytime new data comes in if it is real or generated data.
- Goals
  - Of course the discriminators job is to proof the generator wrong, but over time, hopefully our generated data will get closer to the real data and the discriminator will start guessing wrong.
Downfalls
- Near the real data there is a slope, everywhere else is flat and random which is bad from the larger view

Attention

Sometimes you run a model, and you want to make some feature more prominent. This is called paying attention to a certain factor / feature.
Soft Attention
- each element is given a weight and there is a probabilistic chance that any element is chosen to be paid attention
- c = ax + ax + ax + ax
Hard Attention
- Pick one element and focus solely on that element for that run
- c = ax
Self Attention
- a = softmax or argmax(X transpose * x)
- Input is multiplied by transposed key, then soft maxed. This gives the coefficient for the soft and hard attention proportional