Change the Ylm and Jl kernels back to being single-atom only
This will reduce the amount of faff required for endianness compatibility as well as making the functions simpler.
To feed data to alm_add_many_atoms, invoke the Ylm and Jl kernels several times on separate sub-buffers of the same region.