Correlation between dependent and independent variables in non-normal distribution

Question

Edit & Update:

I am trying to use Python or SPSS to measure the effectiveness of some factors on one or more metrics. My dataset contains 100 records of patients who have been treated different times (e.g., three months). The dataset looks like below:

     a1  a2  a3  b1  b2  b3  metric1 metric2 metric3
1    1.2 2.3 3.5 90  58  29  2.1     3.2     1.2  
2    3.2 3.4 1.5 58  54  39  3.1     4.2     3.2  
...
100  3.1 1.3 2.5 36  63  45  5.1     4.2     3.2

As you can see, factor a (let's say Glucose with non-normal distribution) and factor b (let's say a treatment or drug with normal distribution) have been recorded three times for each patient. In each patient's visit, a metric (for example a health metric) has been recorded as well. Now I want to know how factor b influence on the metric in my dataset during three visits. For example, is there any (co)relation between factor b with the metric in this dataset? If so, to what extent it is significant?

I tried several approaches including one-way Annova or finding correlation between the means of samples, but it was unsuccessful. I know that these kinds of data should be analyzed by repeated measures method, but now that I have multiple independent variables with non-normal distribution I am bit confused. What statistical method I should leverage?

Any help is appreciated!

score -1 · Answer 1 · answered Oct 22 '18 at 13:42

-1

You currently have your data in wide format, I haven't done statistics in Python but for R you need it on long format for most functions.

Convert your dataframe into long. I think you can do that with pd.melt()

df["Patient"] = df.index + 1
pd.melt(df, id_vars=["Patient"], value_vars=['b1', 'b2', 'b3'], var_name='Repeated', value_name='Glucose')

this is wrong ebcause you need to do the same for your treatments, not sure how to do twice, you can do it by separatign the DF and then merging it again.

Your goal dataframe should look like this:

Patient     Glucose  GRepeated  Treatment  TReapeatedb1   Metric MRepeated
1           1.2      a1          90        b1             3.2     metric1  
2           3.2      a2          54        b2             4.2     metric2
...
100         3.1      a3          45        b3             3.2     metric3

answered Oct 22 '18 at 13:42

Alexis Drakopoulos

1,115
7
22

2

I do not understand how this answers the question that was asked. – James Phillips Oct 22 '18 at 14:05
because the reason his anova wasn't working is most likely because the data he has is in the wrong shape. His data seems to be a simple hiearchical nested design model, or perhaps a repeated measures anova. For this he will need his data to be in long format, not wide as he currently has it. – Alexis Drakopoulos Oct 22 '18 at 14:24
@JamesPhillips Thanks for your answer, I do not see how you are converting the data to a long format. Each patient has three values for each factor and three values for each metric, while in the data you created, factors were distributed across the patients, no? – Enayat Oct 22 '18 at 14:43
@AlexisDrakopoulos I think the above comment was for you and not me, as you had answered the question. – James Phillips Oct 22 '18 at 14:51
@EnayatRajabi My example was for 1 factor. Your goal should be to have values per patient, since you are attempting to find the differences in treatments. Then you can make the different treatments and repeated measures into factors which should allow for analysis. – Alexis Drakopoulos Oct 22 '18 at 15:40
@AlexisDrakopoulos Thanks for the comment. Still I don't have any idea what you are trying to achieve. I have treatment per patient and I want to find the correlation of variables considering all three treatments. – Enayat Oct 22 '18 at 18:06
@AlexisDrakopoulos Moreover, as far as I know, we do not need to convert wide format to long format for any analysis. It depends on the type of analysis we want to perform. Beyond that, my question refers to the type of statistical method we need for this type of data. – Enayat Oct 22 '18 at 19:12
I didn't mean that we have to, I meant the packages in R programming usually require long format to function properly (linear models such as aov, lme) – Alexis Drakopoulos Oct 23 '18 at 09:14

Correlation between dependent and independent variables in non-normal distribution

1 Answers1