0

I am trying find mean of the values in two columns.

Input:

tweetcricscore  34 #afgvssco   51
tweetcricscore  23 #afgvszim   46
tweetcricscore  24 #banvsire   12
tweetcricscore  456 #banvsned  46
tweetcricscore  653 #canvsnk   1
tweetcricscore  789 #cricket   178
tweetcricscore  625 #engvswi   46
tweetcricscore  86 #hkvssco    23
tweetcricscore  3 #indvsban    1
tweetcricscore  87 #sausvsvic  8
tweetcricscore  98 #wt20       56

I am trying to find the mean value of col[1] and col[3] horizontally. I.e.

34 + 51 / 2 = 42.5

Code:

data = np.genfromtxt('keyword.csv', delimiter=',', comments=None)

d0=data[:,1]
d1=data[;,3]

and write the output d2 in col[4] in same input file.

I can find many samples for the vertical calculation of mean but not able to find horizontal. The array is NOT to be sorted as both the values are related to each other with respect to col[0] and col[2]. Any suggestion is appreciated.

Thanks in advance.

Sitz Blogz
  • 1,061
  • 6
  • 30
  • 54
  • are you always going to have two static columns that your looking to average? Or will then number of columns be dynamic? –  May 10 '16 at 15:56
  • @pyNoob As of now only 2 columns. Cannot assure about the future. – Sitz Blogz May 10 '16 at 16:21

2 Answers2

3

You can easily do this in Pandas:

import pandas as pd

df = pd.read_csv('keyword.csv', header = None)

df.columns = ['col1','col2','col3','col4']

df['avgCol'] = (df['col2'] + df['col4'])/2
flyingmeatball
  • 7,457
  • 7
  • 44
  • 62
1

In numpy you can accomplish this with np.mean (see docs here. You can run this on either axis; 0 being the vertical and 1 being the horizontal The trick is that you need to remove your columns without text first. After you run your line data = np.genfromtxt('keyword.csv', delimiter=',', comments=None) next delete the columns with text. In this case column 0 and 2, so you can just delete every other row starting with 0 using data2 = np.delete(data, np.s_[::2], 1). Once you have this you can then use d3 = np.mean(data2, axis=1) and will get an array of the desired means. To get this as the forth column in the original data then use data = np.insert(data, data.shape[1], d3, axis=1)

All together:

data = np.genfromtxt('keyword.csv', delimiter=',', comments=None)
data2 = np.delete(data, np.s_[::2], 1)
d3 = np.mean(data2, axis=1)
data = np.insert(data, data.shape[1], d3, axis=1)

My biggest caveat with this is the data2 line, it will only work with a table structured the way you displayed. If you have more columns or columns that aren't ordered as text, float, text, float, ... you will have to rewrite it to correctly remove the columns with text. If necessary you could write up a script to search for columns with NaN values once loaded into a numpy array. eg np.all(np.isnan(data), axis=0) will return a 1-d array with True, False values for whether all values in a column are nan or not. You could then use this to remove columns at index with values of true.

Functionally:

indices = np.all(np.isnan(data), axis=0)
indices = np.where(indices == True)
data2 = np.delete(data, np.s_[indices[0]],1)

Pandas is great because it will automatically ignore columns with text for the mean value. It makes this calc much simpler:

import pandas as pd
df = pd.read_csv('keywords.csv', header=None)
df[df.shape[1]] = df.mean(axis=1)

This creates a new column at the end of the dataframe containing the mean value of integer/float values in that row.

Grr
  • 15,553
  • 7
  • 65
  • 85
  • Thank you for the solution and the detail explanation. I am a bit confused here. What if I have more than 2 columns of numerical values ? and I have to stick to particular columns only ? After the mean calculation I will take that output value to plot a cdf . – Sitz Blogz May 11 '16 at 06:40
  • I tired using the `df[df.shape[1]]` as input for plotting a graph but I am getting error. Any simple solution ? – Sitz Blogz May 11 '16 at 10:23
  • @SitzBlogz whats the error? And how are you trying to plot it? – Grr May 11 '16 at 13:33
  • I'm assuming you want to make a bar chart with some sort of text value from the table associated with the means. `df[df.shape[1]]` is used in the code i provided to notate the number of columns in the table and thus was used to insert a new column after the last column. in your example there are 5 columns so inserting a column at index 5 places it after the last column. For plotting purposes you couldn't use `df[df.shape[1]]` as it is outside of the index. To access the means you could say df[df.shape[1]-1] or if you know the index (in this case 4), you could say `df[4]` – Grr May 11 '16 at 13:50
  • I will be plotting ccdf of the calculated mean value – Sitz Blogz May 11 '16 at 13:51
  • I am trying multiple ways of visualization for a multivariate data points. The data points happen to be related to each other. First I want to clarify the distribution and later post classification I want to visualize the classes in better way. I am thinking of a scatter plot in this case. http://stackoverflow.com/questions/37147592/multiple-inputs-multivariate-data-visualisation/37152745?noredirect=1#comment61863987_37152745 – Sitz Blogz May 11 '16 at 13:56
  • Your way of explanation is really great for someone like me who is at beginning level of programming. I could use some help with the other question and would like to accept your ans. But here the answer I accepted has solved my purpose. – Sitz Blogz May 11 '16 at 13:58