9

I have my data in table as:

id  Author_ID   Research_Area       Category_ID  Paper_Count   Paper_Year   Rank  
---------------------------------------------------------------------------------
1   677         feature extraction  8            1             2005         1
2   677         image annotation    11           1             2005         2
3   677         probabilistic model 12           1             2005         3
4   677         semantic            19           1             2007         1
5   677         feature extraction  8            1             2009         1
6   677         image annotation    11           1             2011         1  
7   677         semantic            19           1             2012         1  
8   677         video sequence      5            2             2013         1  
9   1359        adversary model     1            2             2005         1
10  1359        ensemble method     14           2             2005         2
11  1359        image represent     11           2             2005         3
12  1359        adversary model     1            7             2006         1
13  1359        concurrency control 17           5             2006         2
14  1359        information system  12           2             2006         3  
15  ...         
16  ...  

Whereas I want to have an output of query as:

id  Author_ID   Category_ID  Paper_Count   Category_Prob   Paper_Year   Rank  
---------------------------------------------------------------------------------
1   677         8            1             0.333           2005         1
2   677         11           1             0.333           2005         2
3   677         12           1             0.333           2005         3
4   677         19           1             1.0             2007         1
5   677         8            1             1.0             2009         1
6   677         11           1             1.0             2011         1  
7   677         19           1             1.0             2012         1  
8   677         5            2             1.0             2013         1  
9   1359        1            2             0.333           2005         1
10  1359        14           2             0.333           2005         2
11  1359        11           2             0.333           2005         3
12  1359        1            7             0.5             2006         1
13  1359        17           5             0.357           2006         2
14  1359        12           2             0.142           2006         3  
15  ...         
16  ...  

Whereas Category_Prob is a calculated column which is calculated in two steps as:

Step First, we have to have a SUM of Paper_Count in each Paper_Year for instance i.e. Paper_Year = 2005 and Author_ID = 677, the SUM(Paper_Count) = 3

Step Second, then for each Category_ID, we have to divide Paper_Count with value of SUM(Paper_Count) in that Paper_Year which will be 1/3 i.e. 0.333 and so on...

Moreover, I have tried this query:

SELECT 
    Author_ID, Abstract_Category, Paper_Count,
    [Category_Prob] = Paper_Count / SUM(Paper_Count),
    Paper_Year, Rank
FROM 
    Author_Areas
GROUP BY 
    Author_ID, Abstract_Category, Paper_Year, Paper_Count, Rank
ORDER BY 
    Author_ID, Paper_Year

But it returns just 1 in the column Category_Prob for all of the rows in the table.

halfer
  • 19,824
  • 17
  • 99
  • 186
maliks
  • 1,102
  • 3
  • 18
  • 42

2 Answers2

6

The problem with your query is that you are not grouping by Paper_Year, but also by Author_ID, Abstract_Category, Paper_Count, Rank. Hence SUM(Paper_Count) is equal to Paper_Count for each group.

You can use SUM OVER for this:

SELECT      id, Author_ID, Abstract_Category [Category_ID],  
            Paper_Count, 
            Paper_Count * 1.0 / SUM(Paper_Count)  
            OVER (PARTITION BY Author_ID, Paper_Year) AS [Category_Prob],
            Paper_Year, Rank
FROM        Author_Areas
ORDER BY    Author_ID, Paper_Year

Note: You have to multiply by 1.0 so as to avoid integer division. Note 2: Perhaps you have to add Author_ID field in the PARTITION BY clause as well, if your actual requirement is to group by author, year.

maliks
  • 1,102
  • 3
  • 18
  • 42
Giorgos Betsos
  • 71,379
  • 9
  • 63
  • 98
  • @Giorgos--The query executed successfully but values in the `calculated_column` are not as desired – maliks Apr 18 '16 at 08:27
  • @Bridge I think not. `SUM` with `OVER` clause is available from SQL Server 2005 onwards (not 100% sure though). Sql Server 2012 added the `ORDER BY` in the `OVER` clause so as to calculate running totals. – Giorgos Betsos Apr 18 '16 at 08:27
  • @maliks What results do you get? – Giorgos Betsos Apr 18 '16 at 08:29
  • @GiorgosBetsos Oh yeah! I just added `(PARTITION BY Author_ID, Paper_Year)` and I got desired results. Thanks! – maliks Apr 18 '16 at 08:34
0

I suspect (please confirm) that the datatype of all fields involved are integers. When you calculate with int the return type is also int. You should convert the fields to decimal before calculation.

SELECT Author_ID, Abstract_Category, Paper_Count,
[Category_Prob] = convert(decimal(10,3), Paper_Count) / convert(decimal(10, 3), SUM(Paper_Count)),
Paper_Year, Rank
FROM Author_Areas
GROUP BY Author_ID, Abstract_Category, Paper_Year, Paper_Count, Rank
ORDER BY Author_ID, Paper_Year
HoneyBadger
  • 14,750
  • 3
  • 34
  • 48