On the optimization and generalization of self-attention models : a stability and implicit bias perspective