TCPListener (Server) not getting accept request from Client for all clients before the server instance in ip list, when running in threads

Question

I have 4 EC2 instances and I plan to have a distributed network, so every one will send a data to everyone (including itself).

I first read the ip addresses from a file to a variable ip_address_clone.

Say the list is like this:

A.A.A.A
B.B.B.B
C.C.C.C
D.D.D.D

Then I try to run server and client for all of them in thread, so that theres a sender and receiver worker active in an instance for all instances (again for itself as well).

thread::scope(|s| {
    s.spawn(|| {
        for _ip in ip_address_clone.clone() {
            let _result = newserver::handle_server(INITIAL_PORT + port_count);
        }
    });

    s.spawn(|| {
        let three_millis = time::Duration::from_millis(3);
        thread::sleep(three_millis);

        for ip in ip_address_clone.clone() {
            let self_ip_clone = self_ip.clone();

            let _result = newclient::match_tcp_client(
                [ip.to_string(), (INITIAL_PORT + port_count).to_string()].join(":"),
                self_ip_clone,
            );
        }
    });
});

The server code is:

use std::error::Error;
use tokio::io::{AsyncBufReadExt, AsyncWriteExt, BufReader};
use tokio::net::tcp::ReadHalf;
use tokio::net::TcpListener;

#[tokio::main]
pub async fn handle_server(port: u32) -> Result<(), Box<dyn Error>> {
    let listener = TcpListener::bind(["0.0.0.0".to_string(), port.to_string()].join(":"))
        .await
        .unwrap(); // open connection

    let (mut socket, _) = listener.accept().await.unwrap(); // starts listening
    println!("---continue---");

    let (reader, mut writer) = socket.split(); // tokio socket split to read and write concurrently

    let mut reader: BufReader<ReadHalf> = BufReader::new(reader);
    let mut line: String = String::new();

    loop {
        //loop to get all the data from client until EOF is reached

        let _bytes_read: usize = reader.read_line(&mut line).await.unwrap();

        if line.contains("EOF")
        //REACTOR to be used here
        {
            println!("EOF Reached");

            writer.write_all(line.as_bytes()).await.unwrap();
            println!("{}", line);

            line.clear();

            break;
        }
    }

    Ok(())
}

And client code is:

use std::error::Error;
use tokio::io::AsyncWriteExt;
use tokio::net::TcpStream;

#[tokio::main]
pub async fn match_tcp_client(address: String, self_ip: String) -> Result<(), Box<dyn Error>> {
    // Connect to a peer
    let mut stream = TcpStream::connect(address.clone()).await?;
    // Write some data.
    stream.write_all(self_ip.as_bytes()).await?;
    stream.write_all(b"hello world!EOF").await?;
    // stream.shutdown().await?;
    Ok(())
}

Problem is, I am not getting the communication as I expect to. In fact, the first instance I run (with ssh) receives all the data, the second one receives all data except from the first one, the third one receives all data except from the first and second one, and so on.

Here's a log of the first instance:

Starting
execution type
nok
launched
---continue---
EOF Reached
A.A.A.Ahello world!EOF
---continue---
EOF Reached
B.B.B.Bhello world!EOF
---continue---
EOF Reached
C.C.C.Chello world!EOF
---continue---
EOF Reached
D.D.D.Dhello world!EOF

And log of second instance:

Starting
execution type
nok
launched
---continue---
EOF Reached
B.B.B.Bhello world!EOF
---continue---
EOF Reached
C.C.C.Chello world!EOF
---continue---
EOF Reached
D.D.D.Dhello world!EOF

It seems like though I am using thread, the communication remains synchronous, and a particular instance can only get data from itself to the rest of the ips in ip_address_clone. You can see the number of times ---continue--- occurs in the second instance log, its listener doesnt seem to accept the request from the first instance.

Jishan Shaikh · Accepted Answer · 2023-05-06T18:39:59.347

1

I think, the evidence that "A node is only getting data to itself", highly indicates that it is sending data to its own port (only) and not other ports (which are exactly the same). Here, I believe that unique ports should solve your problem.

You are using the same port number for all the instances. This can cause conflicts when multiple instances try to bind to the same port. Instead, you should use a unique port number for each instance. One practice is that you do this by adding an offset to a base port number (3000, 3001, ...). When each instance binds to a unique port number, it is better for dev testing.
You are creating a new thread for each instance, but each thread is only handling one connection. This can be inefficient and can limit the number of connections that your program can handle. Instead, you can use Tokio's spawn function to spawn a task for each connection. This allows you to handle multiple connections concurrently.
Also, loop is not waiting for the threads to finish before moving on to the next IP address. This can cause issues with synchronization and can lead to unexpected behavior.

On a personal note, testing asynchronous communication between distributed nodes is hard; especially when we have multiple threads, and they don't work.

edited May 06 '23 at 18:39

answered May 06 '23 at 18:33

Jishan Shaikh

1,572
2
13
31

Hi @Jishan . Thanks for the answer. I did some testing. The reason node A only sends to itself is because I start (ssh) node A first. So while node A's server starts and remains active, its client, since not finding any live server except itself, doesnt send any data. On the other hand, node B, which I start after node A, sends data to both itself and node A, since for node B, node A's server is already listening. Can you please tell me a way where client stays active and checks for server (just like server stays active). If thats possible, then even I start A first, its client will wait. – Zubayr May 06 '23 at 18:48
This is why you can see node A has got all 4 messages (as its server stays active for all nodes including itself). But node B (which starts after node A and before node C and node D) only gets data from nodes B, C, D – Zubayr May 06 '23 at 18:51
"You're only starting one server per instance, and all clients are connecting to that one server. This means that the first instance will receive all the data because all other instances are connecting to its server. " Is this what you meant? – Jishan Shaikh May 06 '23 at 18:52
In that case, you can modify your client code to keep retrying until it successfully connects to the server. One way to do this is to wrap the client code in a loop and use a delay to wait between retries. So that you can start the instances in any order you want. – Jishan Shaikh May 06 '23 at 18:53

TCPListener (Server) not getting accept request from Client for all clients before the server instance in ip list, when running in threads

1 Answers1