Hardware And Software Optimizations For Deep Learning Workloads On Graphics Processing Units