Step1 - prepare data in the format that is consumable by GPT series. Typically, for question answers, you will need atleast 3 things. "context" , "response", and "query". The trick however is how do you combine these and convert them into something transformers can work with ?
Step2 - experiment with combinations like "query" + "context" as INPUT and "response" as TARGET. Please note that you might even need to add a "SEPERATOR" between query and context just so the model understands that these are 2 different things. Atleast this is what they would do in order to tackle Squad dataset. Theoretically, there's one more approach where the "query" is the Q and the "context" can be split into multiple lines and paired with the same "query". The reason i think this can work is because in transformers, typically, Q and K are the same ( one is assigned to the other ) BUT when you think logically, the "query" can be mapped to one of many "keys" and these all come from the same context. If this is too confusing, please read up on the basics of transformers ( Jay Al Ammar's blog is brilliant )
Step3 - decide whether you want to copy the weights of a pre trained GPT model or train from scratch.
Step4 - atleast GPT2, whose API, is present on HF, has an output dimension equal to the input dimension ( since the original task is one of predicting masked words ..so it basically is supposed to reproduce the whole input sentence + predict the masked words ). This is going to be a major problem for question answering since the answer is mostly NEVER bigger than the context. Please do NOT confuse this to mean that the answer cant be bigger than the question. Answers can certainly be bigger than the query ..they just cant be bigger than the "context" from where your model is picking the answer.
Having gotten that out of the way, you will need a final layer to reduce the dimensionality of the last linear / dense layer. I have used a Conv1D layer with limited success. The final dimensions should be equal to the length of the answer ( for e.g. if the total inp length / block size / seq length = 1024 tokens and the answer you are expecting is limited to 256 tokens then your final layer should o/p something with dimensionality of 256 )
Final step - prayers :) ..sadly there isn't much in the open source literature that tells us exactly how GPT models are sooooo good at answering queries (even InstructGPT paper is absolutely vague )