>The only type of labeled training images that I've seen is when the entire image is labeled with an entity (ex: "dog", or "cat").
In the domain of machine learning, this is called "classification".
Detection could be seen as predicting bounding boxes + classification of the corresponding cropped areas. You can imagine an algorithm that takes every possible cropped areas of an image and feed them to a classifier. In theory, it would work, but practically would be way to long in processing time. Some of the first detection neural nets (like RCNN) had a module whose job was to propose areas/regions to the classifier, and thus limited the processing time.
YOLO is even faster by doing both tasks in a single convolution neural network. I think it is interesting for you to know how YOLO works, but the subject is too long to be described here. May I suggest to you to watch the videos from https://www.coursera.org/learn/convolutional-neural-networks (week 3) ? I think you have to subscribe to access the videos, but it is free. And it is very well explained, I think.