The error you are getting is because the hostIp
column in your Parquet file has 549 rows, but the read_parquet()
method is expecting it to have 548 rows.
The code you have provided shows that you are iterating over the REQUIRED_COLUMNS
list and calling read_parquet()
for each column individually. This works because each column has 548 rows. However, when you call read_parquet()
with the REQUIRED_COLUMNS
list as the columns
argument, it will try to read all of the columns in the list, including the hostIp
column, which has 549 rows. This is why you are getting the error.
To solve this problem, you can either:
- Change the
read_parquet()
method to only read the first 548 rows of the hostIp
column.
- Remove the
hostIp
column from the REQUIRED_COLUMNS
list.
Here is an example of how to change the read_parquet()
method to only read the first 548 rows of the hostIp
column:
def read_parquet_with_limited_hostIp(path, columns):
data = pd.read_parquet(path, columns=columns)
hostIp_data = data["hostIp"][:548]
return hostIp_data
hostIp_data = read_parquet_with_limited_hostIp(path, REQUIRED_COLUMNS)
Here is an example of how to remove the hostIp
column from the REQUIRED_COLUMNS
list:
REQUIRED_COLUMNS = REQUIRED_COLUMNS[:7]
data = pd.read_parquet(path, columns=REQUIRED_COLUMNS)