Multiple inputs multivariate data visualisation

Question

I am trying to visualise multivariate data model by reading them from multiple input files. I am looking for a simple solution to visualise multiple category data read from multiple input csv files. The no. Of rows in inputs range from 1 to 10000s in individual files. The format is same of all the inputs with 4 columns csv files.

Input 1

tweetcricscore 34  51 high

Input 2

tweetcricscore 23 46 low
tweetcricscore 24  12 low
tweetcricscore 456 46 low

Input 3

tweetcricscore 653  1 medium 
tweetcricscore 789 178 medium

Input 4

tweetcricscore 625  46 part
tweetcricscore 86  23 part
tweetcricscore 3  1 part
tweetcricscore 87 8 part
tweetcricscore 98 56 part

The four inputs are each of different category and col[1] and col[2] are pair results of some kind of classification. All the inputs here are the outputs of the same classification. I want to visualise them in better way to show all the categories in one plot only. Looking for a python or pandas solutions for the same. Scatter plot or any best approach to plot.

I have already posted this query in Data analysis section of stack exchange and I have no luck hence trying here. https://datascience.stackexchange.com/questions/11440/multi-model-data-set-visualization-python

May be something like below image where every class has its own marker and color and can be categorized or any better way to show the pair values together.

code: Edit 1: I am trying to plot a scatter plot with above input files.

import numpy as np
import matplotlib.pyplot as plt
from pylab import*
import math
from matplotlib.ticker import LogLocator
import pandas as pd

df1 = pd.read_csv('input_1.csv', header = None)

df1.columns = ['col1','col2','col3','col4']
plt.df1(kind='scatter', x='col2', y='col3', s=120, c='b', label='Highly')

plt.legend(loc='upper right')
plt.xlabel('Freq (x)')
plt.ylabel('Freq(y)')
#plt.gca().set_xscale("log")
#plt.gca().set_yscale("log")
plt.show()

Error:

Traceback (most recent call last):
  File "00_scatter_plot.py", line 12, in <module>
    plt.scatter(x='col2', y='col3', s=120, c='b', label='High')
  File "/usr/lib/pymodules/python2.7/matplotlib/pyplot.py", line 3087, in scatter
    linewidths=linewidths, verts=verts, **kwargs)
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 6337, in scatter
    self.add_collection(collection)
  File "/usr/lib/pymodules/python2.7/matplotlib/axes.py", line 1481, in add_collection
    self.update_datalim(collection.get_datalim(self.transData))
  File "/usr/lib/pymodules/python2.7/matplotlib/collections.py", line 185, in get_datalim
    offsets = np.asanyarray(offsets, np.float_)
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 514, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
ValueError: could not convert string to float: col2

Expected Output Plotting- Pandas

@MaxU That is just expected output from Pandas Doc. Instead of `Group 1` and `Group 2` I will have `high` `low` `medium` `part` — Sitz Blogz, May 11 '16 at 17:34

MaxU - stand with Ukraine · Accepted Answer · 2016-05-11T18:08:07.293

2

UPDATE:

with different colors:

colors = dict(low='DarkBlue', high='red', part='yellow', medium='DarkGreen')

fig, ax = plt.subplots()

for grp, vals in df.groupby('col4'):
    color = colors[grp]
    vals[['col2','col3']].plot.scatter(x='col2', y='col3', ax=ax,
                                       s=120, label=grp, color=color)

PS you will have to care that all your groups (col4) - are defined in colors dictionary

OLD answer:

assuming that you've concatenated/merged/joined your files into single DF, we can do the following:

fig, ax = plt.subplots()
[vals[['col2','col3']].plot.scatter(x='col2', y='col3', ax=ax, label=grp)
 for grp, vals in df.groupby('col4')]

PS as a homework - you can play with colors ;)

edited May 11 '16 at 18:08

answered May 11 '16 at 17:51

MaxU - stand with Ukraine

205,989
36
386
419

By far inputs are independent but ok I can merge them no problem with that part but I am keen about different markers and colors with respect to groups. May I request you for full code so that I wont be get confused anymore – Sitz Blogz May 11 '16 at 17:54
Thank you! Seriously appreciate the help.. Helped me big time. – Sitz Blogz May 11 '16 at 17:57
1

@SitzBlogz, always glad to help! :) – MaxU - stand with Ukraine May 11 '16 at 18:04
I am getting this error `vals[['col2','col3']].plot.scatter(x='col2', y='col3', ax=ax,AttributeError: 'function' object has no attribute 'scatter'` By the way I am using python 2.7 in case if this makes any difference. – Sitz Blogz May 12 '16 at 01:53

score 2 · Answer 2 · answered May 12 '16 at 06:25

While Trying with @MaxU's solution and his solution is the great but somehow I had few error and in process to patch the errors. I came across this alternative Boken which looks similar to Seaborn I am sharing the code just as an alternative for some beginner's reference.

Code:

import numpy as np
import matplotlib.pyplot as plt
from pylab import*
import math
from matplotlib.ticker import LogLocator
import pandas as pd
from bokeh.charts import Scatter, output_file, show

df = pd.read_csv('input.csv', header = None)

df.columns = ['col1','col2','col3','col4']

scatter = Scatter( df, x='col2', y='col3', color='col4', marker='col4', title='plot', legend=True)

output_file('output.html', title='output')

show(scatter)

Output:

score 1 · Answer 3 · answered May 11 '16 at 03:45

Consider plotting a pivot_table of a pandas df which concatenates the many .txt files. Below runs two types of pivots with Type grouping and Class2 grouping. Gaps are due to NaN in pivoted data:

import pandas as pd
import numpy as np
from matplotlib import rc, pyplot as plt
import seaborn

# IMPORT .TXT DATA
df = pd.concat([pd.read_table('TweetCricScore1.txt', header=None, sep='\\s+'),
                pd.read_table('TweetCricScore2.txt', header=None, sep='\\s+'),
                pd.read_table('TweetCricScore3.txt', header=None, sep='\\s+'),
                pd.read_table('TweetCricScore4.txt', header=None, sep='\\s+')])    
df.columns = ['Class1', 'Class2', 'Score', 'Type']

# PLOT SETTINGS
font = {'family' : 'arial', 'weight' : 'bold', 'size'   : 10}    
rc('font', **font); rc("figure", facecolor="white"); rc('axes', edgecolor='darkgray')

seaborn.set()      # FOR MODERN COLOR DESIGN

def runplot(pvtdf):
    pvtdf.plot(kind='bar', edgecolor='w',figsize=(10,5), width=0.9, fontsize = 10)    
    locs, labels = plt.xticks()
    plt.title('Tweet Cric Score', weight='bold', size=14)
    plt.legend(loc=1, prop={'size':10}, shadow=True)
    plt.xlabel('Classification', weight='bold', size=12)
    plt.ylabel('Score', weight='bold', size=12)
    plt.tick_params(axis='x', bottom='off', top='off')
    plt.tick_params(axis='y', left='off', right='off')
    plt.ylim([0,100])
    plt.grid(b=False)
    plt.setp(labels, rotation=45, rotation_mode="anchor", ha="right")
    plt.tight_layout()

# PIVOT DATA
sumtable = df.pivot_table(values='Score', index=['Class2'],
                          columns=['Type'], aggfunc=sum)
runplot(sumtable)
sumtable = df.pivot_table(values='Score', index=['Type'],
                          columns=['Class2'], aggfunc=sum)
runplot(sumtable)

Thank you so much. This is great representation the first plot with respect to class is what I am looking for. But the values in col[1] and col[2] both are pair values and are to be considered as one pair of values. They both together are to be plotted. Non of the columns have headers too. — Sitz Blogz, May 11 '16 at 04:16
To effectively work especially data frame manipulation like plot and pivot table, headers help which I added. You can combine both in index of pivot_table: `index=['Class1', 'Class2']`. Or concatenate first two columns as one: `df['newcol'] = df['Class1'] + df['Class2'].astype(str)` — Parfait, May 11 '16 at 13:34

Grr · Answer 4 · 2016-05-11T17:58:35.053

So first off, in your plotting code. There are a couple errors and one looks like just a typo based on the error you included. After changing the column names you call plt.df1(...) This should be plt.scatter(...) and it looks like from the error you included that is what you actually called. The problem that your error is alerting you to is that you are trying to call x='col2' with 'col2' being the value matplotlib wants to plot. I realize you are trying to feed in 'col2' from df1 but unfortunately that is not what you did. In order to do that you just need to call plt.scatter(df1.col2, df1.col3, ...) where df1.col2 and df1.col3 are series representing your x and y values respectively. Fixing this will give you the following output (I used input4 as it has the most data points):

As far as plotting several categories onto one chart you have several options. You could change the plotting code to something like:

fig, ax = plt.subplots()
ax.plot(df1.col2, df1.col3, 'bo', label='Highly')
ax.plot(df2.col2, df2.col2, 'go', label='Moderately')
ax.legend()
ax.xlabel('Freq (x)')
ax.ylabel('Freq(y)')
plt.show()

However this is rather clunky. Better would be to have all of the data in one dataframe and add a column titled label that takes the label value you want based on how you categorize the data. That way you could then use something like:

fig, ax = plt.subplots()
for group, name in df.groupby('label'):
    ax.plot(group.x, group.y, marker='o', label=name)
ax.legend()
plt.show()

Thank you so much.. :) you guys are going to teach me python real soon.. Seriously appreciate the help — Sitz Blogz, May 11 '16 at 17:56
MaxU's for loop will do the same thing as my final suggestion (except mine will by default give you different colors), for further reading look up comprehensions (list, dictionary, etc.) as that is how his is written. I prefer these but find they can be a little confusing to newer python users. — Grr, May 11 '16 at 18:01

Multiple inputs multivariate data visualisation

4 Answers4

Linked