0

I have the text file read by python.

import numpy as np

with open("Input2010_5a.txt", "r") as file:
for line in file:
    date, long, lat, depth, temp, sal = line.split("\t")
    line_data = []
    line_data.append(float(date))
    line_data.append(float(long))
    line_data.append(float(lat))
    line_data.append(float(depth))
    line_data.append(float(temp))
    line_data.append(float(sal))

As a result I've received 41 lists in such a view:

[2010.36, 23.2628, 59.7768, 1.0, 4.1, 6.04] #it's one of them

Now I need to make a covariance matrix using them. I'm not sure how to make it.

Narendra
  • 1,511
  • 1
  • 10
  • 20

3 Answers3

0

Extracting your lists from your txt file

I would first extract your lists from your text file into some sort of dictionary structure, something along the lines of:

d = {}
with open("Input2010_5a.txt", "r") as file:
    counter = 0
    for line in file:
        date, long, lat, depth, temp, sal = line.split("\t")
        line_data = []
        line_data.append(float(date))
        line_data.append(float(long))
        line_data.append(float(lat))
        line_data.append(float(depth))
        line_data.append(float(temp))
        line_data.append(float(sal))
        d['list'+str(counter)] = line_data
        counter += 1

And d will be a dictionary looking something like this:

{'list0': [2010.36, 23.2628, 59.7768, 1.0, 4.1, 6.04], 
 'list1': [more, list, values, here], ...], ...
}

covariance matrix method 1: numpy

You can stack your 41 lists contained in your dictionary d and then use np.cov.

import numpy as np

all_ls = np.vstack(d.values())

cov_mat = np.cov(all_ls)

Which will then return your covariance matrix

Covariance matrix Method 2: pandas:

You can also use pandas.cov to get the same covariance matrix, if you prefer to have it in pandas tabular format for later:

import pandas as pd

df=pd.DataFrame(d)

cov_mat = df.cov()

Minimal example

If you had a txt file that looked like:

2010.36 23.2628 59.7768 1.0 4.1 6.04
2018.36 29.2    84  2.0 8.1 6.24
2022.36 33.8    99  3.0 16.2    6.5

The result of method 1 would give you:

array([[ 661506.97804414,  662002.706604  ,  661506.6953528 ],
       [ 662002.706604  ,  662576.37510667,  662123.94745333],
       [ 661506.6953528 ,  662123.94745333,  661701.07526667]])

and method 2 would give you:

               list0          list1          list2
list0  661506.978044  662002.706604  661506.695353
list1  662002.706604  662576.375107  662123.947453
list2  661506.695353  662123.947453  661701.075267
sacuL
  • 49,704
  • 8
  • 81
  • 106
0

I found a bit tricky how np.cov calculates the covariance matrix. By the Wikipedia definition the element on the i, j position is the covariance between the ith and the jth features. As an example:

the variation in a collection of random points in two-dimensional space cannot be characterized fully by a single number, nor would the variances in the x and y directions contain all of the necessary information; a 2×2 matrix would be necessary to fully characterize the two-dimensional variation.

That said, since you have 6 dimensions, you should have a 6x6 matrix.

Following that, I did some research and found this question that uses the rowvar=False as shown bellow:

import numpy as np
l1 = [2010.36, 23.2628, 59.7768, 1.0, 4.1, 6.04]
l2 = [2018.36, 29.2, 84, 2.0, 8.1, 6.24]
all_ls = np.vstack((l1,l2))
np.cov(all_ls, rowvar=False)

You can build your all_ls stacking as many l's as you have and the covariance matrix would still be a 6x6 matrix.

Additionally, you can note that the np.cov calculates the covariance for all pairs of variables passed as parameters. For a better understanding about it I recommend this question, which shows how the np.cov gets a 2x2 matrix from your input when you don't set the rowvar=False

leoschet
  • 1,697
  • 17
  • 33
0

I believe the most pythonic way could be the following, using pandas:

import pandas as pd

file_path = "Input2010_5a.txt"
cov = pd.read_csv(file_path, sep='\t').cov()

In addition, if you'd like to visualise the matrix, you could use seaborn.heatmap:

from matplotlib import pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = [40, 40]
plt.axis('scaled')
sns.heatmap(cov, 
        annot=True,
        cbar = False,
        fmt="0.2f",
        cmap="YlGnBu",
        xticklabels=range(len(cov)),
        yticklabels=range(len(cov)))
plt.title("Covariance matrix")

Covariance matrix: The following matrix was generated using a randomize matrix of the same shape of your data, (41, 6):

Covariance matrix

Luca Cappelletti
  • 2,485
  • 20
  • 35