No, showing the bubble after the list is not possible.
When you add a list to your response, the spoken text will always appear before the list. This is mainly due to the fact that the spoken/chat part of the conversation is separate from the visual part of your conversation. Even when adding the response after the list in your code, the displaying of rich response is controlled by Google.
Example:

conv.ask('This is a list example.');
// Create a list
conv.ask(new List({
title: 'List Title',
items: {
'SELECTION_KEY_ONE': {
synonyms: [
'synonym 1',
'synonym 2',
'synonym 3',
],
title: 'Title of First List Item',
description: 'This is a description of a list item.',
image: new Image({
url: 'https://storage.googleapis.com/actionsresources/logo_assistant_2x_64dp.png',
alt: 'Image alternate text',
}),
},
'SELECTION_KEY_TWO': {
synonyms: [
'synonym 4',
'synonym 5',
'synonym 6',
],
title: 'Title of Second List Item',
description: 'This is a description of a list item.',
image: new Image({
url: 'https://storage.googleapis.com/actionsresources/logo_assistant_2x_64dp.png',
alt: 'Image alternate text',
}),
}
}
}));
conv.ask("Please make your selection");
By the look of your example it seems as if you are trying to show the user a couple options on the screen to control the conversation, are you sure Suggestion Chips wouldn't be a better fit for this? These chips are intended to give the user options and are far easier to implement than a list.
Delaying the speech, not the bubble
If you don't want to go that way, what you could do, is add an delay in the spoken text via SSML, but this would only change the experience for people using your action via voice. It wouldn't change the display location of the speech bubble when using the Google Assistant on your phone. For anyone using your action without a screen, this could cause confusion because the speech is being delayed for a list, which is never going to show on their device since it has no screen.
Design in a voice first experience
In general it is a good practice to design your conversation around the voice only part of your conversation. By making your conversation dependable on a list, you limit the amount of platforms you can deploy your action to. A voice first approach to this problem could be to create intents for each option your action supports and opening your welcome intent with a generic message such as "How can I assist you?" and having a fallback intent in which you assist the user by speaking out the different options that they can use. This could be combined with Suggestion Chips to still give the guiding visuals that you desire.
It is a bit more work to implement, but it does give your bot a great more amount of flexibility in its conversation and the amount of platforms it can support.