Hardware/software co-design and compiler techniques for efficient hardware acceleration of dense linear algebra kernels and machine learning applications