In the paper, Knuth describes how the strategy was chosen:
Table 1 was found by choosing at every stage a test pattern that minimizes the maximum number of remaining possibilities, over all conceivable responses by the codemaker. If this minimum can be achieved by a “valid” pattern (a pattern that makes “four black hits” possible), a valid one should be used. Subject to this condition, the first such test pattern in numeric order was selected. Fortunately this procedure turns out to guarantee a win in five moves.
So it is to some extent a greedy strategy (trying to make the most progress at each step, rather than overall), and moreover there's an ad-hoc tie-breaking strategy. This means that it need not be optimal in expected value, and indeed Knuth says exactly that:
The strategy in Table 1 isn’t optimal from the “expected number of moves” standpoint, but it is probably very close. One line that can be improved [...]
So already at the time the paper was published, Knuth was aware that it's not optimal and even had an explicit example.
When this paper was republished in his collection Selected Papers on Fun and Games (2010), he adds a 5-page addendum to the 6-page paper. In this addendum, he starts by mentioning randomization in the very first paragraph, and discusses the question of minimizing the expected number of moves. Analyzing it as the sum of all moves made over all 1296 possible codewords, he mentions a few papers:
His original algorithm gave 5801 (average of 5801/1296 ≈ 4.47608), and the minor improvement gives 5800 (≈ 4.4753).
Robert W. Irving, “Towards an optimum Mastermind strategy,” Journal of Recreational Mathematics 11 (1978), 81-87 [while staying within the “at most 5” achieves 5664 ⇒ ≈4.37]
E. Neuwirth, “Some strategies for Mastermind,” Zeitschrift fur Operations Research 26 (1982), B257-B278 [achieves 5658 ⇒ ≈4.3657]
Kenji Koyama and Tony W. Lai, “An optimal Mastermind strategy,” Journal of Recreational Mathematics 25 (1993), 251-256 [achieves 5626 ⇒ ≈4.34104938]
The last of these is the best possible, as it was found with an exhaustive depth-first search. (Note that all of these papers can do slightly better in the expected number of moves, if you allow them to take 6 moves sometimes... I gave the numbers with the “at most 5” constraint because that's what the question here asks for.)
You can make this more general (harder) by assuming the codemaker is adversarial and does not choose uniformly at random among the 1296 possible codewords, but according to whatever distribution will make it hardest for the codebreaker. Finally he mentions a lot of work done by Tom Nestor, which conclusively settles many such questions.
You might have fun trying to follow up or reproduce these results (e.g. write the exhaustive search program). Enjoy!