In order to define a view transformation you need to calculate an orthonormal base. This is a set of base vectors that are perpendicular to each other and that have unit length.
The unit length criterion can be easily achieved by normalizing the vectors.
The cross product is used to make the vectors orthogonal. As you already mentioned, the x-vector is orthogonalized first and afterwards the up-vector is recalculated by orthogonalizing it.
The x-axis (right direction) is always dependent on the view direction and the up vector. If we assume an up-vector of (0, 1, 0), then the x-axis only depends on the view direction. Here are some examples (the view direction is always in the same plane)

You see that the right vector flips when the view direction passes through the up vector and changes its side. That's the reason why the calculation fails if up vector and view direction are parallel. There is no way to determine where the right vector should point to (resulting in a zero cross product). If you render a scene with these vectors, you will notice that the image will flip horizontally when the view direction passes the up vector.
If you want to specify a view direction parallel to the y-axis, you have to provide a non-parallel up vector. With this it is possible to calculate a reasonable right vector.
Of course, you can calculate the up vector if you want to fix a certain right vector with up = cross(direction, right)
.