1

I need to create a parallel axis coordinate plot (preferably withGGally's ggparcoord function). However, the vanilla geom_line could also potentially work with some extra transformation.

Background (if TL;DR just skip to 'Problem')

The data relates to the generational pedigree of cattle, which for each unique cow/bull breed the pedigree codes and the value for the unique cow/bull breed unique are given. To make this simpler here is a sample of the data:

df <- read.csv("git_df.csv")
print(df)

   Cow.Value      Breed          Level.A    Level.B        Level.AA     Level.AB         Level.BA   Level.BB
1         99  GT 190089  AG 130024 HH(c)  AG 060290       AG 090751   HJB 030230        AG 010258  AG 010013
2        110  GT 190064  AG 130024 HH(c) HCO 070086       AG 090751   HJB 030230        VV 030346  EI 980066
3         95  GT 190104      AG 150769 P  GT 150012 AG 130115 PP(c)    AG 130268       HCO 090066 HCO 060156
4        118  GT 190085 CEF 140311 HH(c) NFS 070087      CEF 120367   CEF 080047       MMJ 000174 NFS 000255
5        104  GT 190067 LAR 150117 HH(c) NFS 170004       BP 100017   LAR 120317 ABB 140515 HH(c) NFS 140017
6        104  AG 200016        AG 140311  AG 120380       AG 110038 JMP 080335 P        AG 090082  AG 070232
7         95  GT 200035  AG 130024 HH(c)  GT 140013       AG 090751   HJB 030230        JL 070050 HCO 060020
8         91  GT 200012        AG 160401  AG 080720       AG 100080    AG 090489        AG 050085  AG 040247
9         79  GT 200010       JRP 120081  GT 160005      LAR 070055   JRP 010030       HCO 090066 DBP 090204
10        87  GT 200019        GT 160025 TOR 070207      GZV 140022   GZV 130082       TOR 050162 TOR 050012
11       108  GT 190059  AG 130024 HH(c) NFS 140198       AG 090751   HJB 030230       NFS 090081 NFS 070324
12        90  AG 200024        AG 130147  AG 150871       AG 080210    AG 100156        AG 100080  AG 110728
13        97  GT 200032      AG 150769 P  GT 170036 AG 130115 PP(c)    AG 130268       FCT 120053 HKB 140073
14        95  GT 200020       LMR 130094  AG 060192       AG 100008   LMR 070093       LAR 000084 HJB 970042
15        78 LMR 200004        AG 090751 LMR 150145      CEF 040431    AG 980250       LMR 110036 LMR 070193
16        92  GT 190123       LMR 130094 CRV 150080       AG 100008   LMR 070093       CRV 120095 CRV 120075
17        81  GT 190119       JRP 120081  GT 160007      LAR 070055   JRP 010030       HCO 090066 DBP 090187
18        89  GT 190100        GT 160025  GT 170020      GZV 140022   GZV 130082       AEJ 130013 GZV 120177
19        90  GT 200013      AG 150769 P HDE 150273 AG 130115 PP(c)    AG 130268        AG 110726 HDE 130099
20       100  GT 200043      AG 150769 P TOR 170065 AG 130115 PP(c)    AG 130268       TOR 130174 TOR 090131

So for example, in the first row the cattle Breed 'GT 190089' is the offspring of 'AG 130024 HH(c)'(Level.A) and 'AG 060290'(Level.B). Similarly, going down a level in the generation, the cow 'AG 130024 HH(c)'(Level.A) is the offspring of 'AG 090751' (Level.AA) and 'HJB 030230' (Level.AB). Coupled with the pedigree info I have the breed Cow.Value for the breed in question, which is basically just a scoring value.

Problem

  • Place all codes (Breed and all levels) on the y-axis with text labels.
  • Place the (Breed and all levels) on the x-axis in order of columns.
  • Set a colour scale aesthetic to the lines which shows the Cow.Value (which is going to be the same colour all way through).

What want to do is something like this:

plt <- GGally::ggparcoord(df,
                          columns=c(2:8),groupColumn = 1,scale="globalminmax")
plt

enter image description here

Everything in the plot is how I want it to be, except I need the all unique codes to be displayed on the y-axis (that is, the text labels) and not those values as is currently. This way I can trace the lineage. Furthermore, I'd just like to insert minor y-axis grids on the plot to look them up more accurately. Finally, if possible (not very important) I want to arrange the Breed according to their value, ie, Breed with the highest value starting at the top of the y-axis.

EDIT

I found another way around the problem without using ggparcoord by using geom_point with geom_path.

df_r <- reshape::melt(df,id.vars="Cow.Value")
plt <- df_r %>%  ggplot(aes(x = variable,y=value,col=Cow.Value))+
  geom_point()+geom_path()
plt

Which produces the right points but the paths (lines) are missing.

enter image description here

Please find the csv below for reproducibility. Thanks!

"Cow.Value","Breed","Level.A","Level.B","Level.AA","Level.AB","Level.BA","Level.BB"
99,"GT 190089","AG 130024 HH(c)","AG 060290","AG 090751","HJB 030230","AG 010258","AG 010013"
110,"GT 190064","AG 130024 HH(c)","HCO 070086","AG 090751","HJB 030230","VV 030346","EI 980066"
95,"GT 190104","AG 150769 P","GT 150012","AG 130115 PP(c)","AG 130268","HCO 090066","HCO 060156"
118,"GT 190085","CEF 140311 HH(c)","NFS 070087","CEF 120367","CEF 080047","MMJ 000174","NFS 000255"
104,"GT 190067","LAR 150117 HH(c)","NFS 170004","BP 100017","LAR 120317","ABB 140515 HH(c)","NFS 140017"
104,"AG 200016","AG 140311","AG 120380","AG 110038","JMP 080335 P","AG 090082","AG 070232"
95,"GT 200035","AG 130024 HH(c)","GT 140013","AG 090751","HJB 030230","JL 070050","HCO 060020"
91,"GT 200012","AG 160401","AG 080720","AG 100080","AG 090489","AG 050085","AG 040247"
79,"GT 200010","JRP 120081","GT 160005","LAR 070055","JRP 010030","HCO 090066","DBP 090204"
87,"GT 200019","GT 160025","TOR 070207","GZV 140022","GZV 130082","TOR 050162","TOR 050012"
108,"GT 190059","AG 130024 HH(c)","NFS 140198","AG 090751","HJB 030230","NFS 090081","NFS 070324"
90,"AG 200024","AG 130147","AG 150871","AG 080210","AG 100156","AG 100080","AG 110728"
97,"GT 200032","AG 150769 P","GT 170036","AG 130115 PP(c)","AG 130268","FCT 120053","HKB 140073"
95,"GT 200020","LMR 130094","AG 060192","AG 100008","LMR 070093","LAR 000084","HJB 970042"
78,"LMR 200004","AG 090751","LMR 150145","CEF 040431","AG 980250","LMR 110036","LMR 070193"
92,"GT 190123","LMR 130094","CRV 150080","AG 100008","LMR 070093","CRV 120095","CRV 120075"
81,"GT 190119","JRP 120081","GT 160007","LAR 070055","JRP 010030","HCO 090066","DBP 090187"
89,"GT 190100","GT 160025","GT 170020","GZV 140022","GZV 130082","AEJ 130013","GZV 120177"
90,"GT 200013","AG 150769 P","HDE 150273","AG 130115 PP(c)","AG 130268","AG 110726","HDE 130099"
100,"GT 200043","AG 150769 P","TOR 170065","AG 130115 PP(c)","AG 130268","TOR 130174","TOR 090131"

1 Answers1

0

You could use the breaks to give each row a number with the labels from the breed column like this:

library(GGally)
plt <- GGally::ggparcoord(df,
                          columns=c(2:8),groupColumn = 1,scale="globalminmax") +
  scale_y_continuous(breaks = 1:nrow(df), labels = df$Breed)
plt

Created on 2023-01-18 with reprex v2.0.2

Quinten
  • 35,235
  • 5
  • 20
  • 53
  • Thanks. But the big problem here is that the labels on the y-axis are codes only from the 'Breed' column. I need to make the y-axis map all the codes (this includes the codes from the levels) for the plot to make sense. Take the top one for example (GT 200043), at 'Level.B' it has a point. This can't be because there is not a code 'GT 200043' in 'Level.B'. IF the y-axis included all codes then the point would be mapped correctly (not in the same horizontal line as 'GT 200043'. Hope this makes sense. – Chris Oosthuizen Jan 18 '23 at 14:06
  • @Quintin can you please see my 'EDIT', which should make things more clear. – Chris Oosthuizen Jan 18 '23 at 14:44