So the most basic, tautological answer is that you rotoscope it, which means different things depending on the footage:
If you have arbitrary footage, i.e. jimmi hendrix archival footage, then you don't have a lot of options to work intelligently - since the colors are muted, the quality is low, and the footage was never shot with rotoscoping in mind. For that footage, you go frame by frame, keyframing masks for each.
If the source video is quality, you can do some operations to help, like for the guitar, you can try motion tracking it so that you only need to adapt your mask when the guitar tilts and effectively changes shape. You can do the same thing with individual body parts to limit your work to adapting the masks on a frame by frame basis rather than moving and then adapting the mask. Often breaking the task into a few passes (head, torso, arms, legs) can go faster than trying to mask it all at once, since changes to any one feature will be smaller than the overall change in pose.
If you have the opportunity to shoot your footage then you can save yourself a few weeks of work by shooting intelligently. If you want to isolate the guitar player and the guitar, you can do this by covering the guitar in blue painters tape and shooting against a solid color background. That way you can do a chroma key + some rough masking by hand. Certain elements of the guitar will key badly like the strings, so I would add a couple tracking points to the guitar so that you can add a rough mask covering the strings/fretboard/whammy bar