FPGA's are effectively a collection of gates, flops, memory and interfaces that I will refer to as resources collectively. Depending on the vendor, there are various architectures and devices sizes that provide different quantities of those resources that change the price point. We are talking about devices that could be as small as hundred gates or flops and no memory to devices with hundreds of thousands of flops and Megabits of memory.
Resource scale will determine your ultimate parallelization scale. You need to determine how many resources a single processing instance will require in terms of memory, flops, gates, etc.. The single instance resource count can be used as a denominator in a ratio with the overall resource counts of a particular device. In practice, FPGA's become harder to synthesize the fuller they get so that will be an adjustment to consider. So far the calculation looks like this:
(total resources - reserved resources)
total_parallel_instances = ______________________________________
single_instance_resource
There are other limits you will observe relative to parallelization as well that will affect the ultimate answer about what level of parallelization can be achieved realistically.
You ask about number of AND's that can be run. If an FPGA has 100 logic blocks, and each block has 2 configurable gates that can be implemented as AND's then you can have 200 AND gates. If that isn't enough for a particular application, a larger device can be selected with more logic blocks. The real limit factor here, is how much can you spend on the device and how much board realestate can you afford. If cost and space are not a factor, you can have millions of AND's if need be.
Approaching your question about as an FPGA architect, I would point out that the FPGA device has to interface to your overall system. If this is part of an x86 system, the most traditional mechanism would be a PCIe interface. Your design could be different, but this is an important limiting factor. Your data rate in and out is limited by your interface so depending on your design, your parallelism doesn't need to be greater than what the interface can support. Additionally, you do have to factor in how long your processing takes.
Lets say a single instance of a processing mechanism take 100 clocks of processing time to obtain a result. And the time it takes to move the data in to the processor is 50 clocks. As a result, 2 bundles of data can be transferred in the time it takes to process one bundle. Discarding any other inefficiencies, only 2 processing instances are required to keep up with the interface. Adding a third processing mechanism in this situation would result in having an idled processing instance hanging around 33% of the time.
Parallel mechanisms that work with packetized data can/may use DMA based technology to move information into and out of an FPGA design. There is overhead associated with that DMA engine and software. That overhead will effect the interfaces data rate too relative to data size and will be observed each time a DMA transfer starts. Bundling up a bunch of data that can feed several instances of parallel FPGA processing instances in a single transfer, will result in only one overhead hit, thus raising the efficiency.
That all said, you will find that if the data you are moving is very small and the over head is high enough and processing time small enough, you will see that doing the job in software is actually faster. But if the processing time is long relative to the transfer overhead, that's when the parallelism scale will have more benefit.