Wednesday, April 18, 2012

Dynamic Pivoting in SQL Server

Introduction

Pivoting data is sometimes necessary to create BI data sets. It can be easy to do, and it also can be a hard task to complete, especially if you do not know how wide your data is going to be when it pivots out. Using T-SQL, you can dynamically create the pivot queries instead of spending hours trying to find the values that need to be pivoted, or writing out a long aggregate query.

The code

I have two examples: One pivots data from system tables and counts the number of columns it has with a certain data type. The second one finds the average number of days between dates on a table that is used for activity tracking. I'll break down the code and explain what is going on. If you are not familiar with the pivot and unpivot syntaxes, read up on those before going on.

First example

In this example, we are using the following system tables: sys.columns, sys.objects, and sys.types. We are going to find the number of columns that the tables have with the data types which they use. We could hard-code all of the data-types, then have a bunch of unused columns with 0 in them for every table, or we could dynamically make a list of the data-types that are used, then pivot from there.
First, we'll select the data into a global temporary table (##Tables).
if object_id('tempdb..##Tables') is not null 
    drop table ##Tables

select 
    c.Name As ColumnName
    ,t.Name As DataType
    ,o.Name As TableName
Into ##Tables
from sys.columns c
join sys.objects o on 
    o.object_id = c.object_id
    And o.type_desc = 'USER_TABLE'
join sys.types t on
    t.system_type_id = c.system_type_id
What we have in ##Tables is a list of all of the columns with their data type and parent table.
Next, a cursor is used and goes through the distinct data-types that we have in ##Table. We add each data-type that is in the table into @DtSql, which will be used later.
Declare @sql varchar(max)
,@DtSql varchar(max)
,@DataType varchar(100)
,@c bit

Set @DtSql = ''
Set @c = 0

Declare data_type cursor 
for Select Distinct DataType 
From ##Tables
Order By 1

Open data_type
fetch next from data_type
into @DataType

While @@Fetch_Status = 0
Begin
    if(@c = 1)
        Set @dtSql = @DtSql + ',[' + @DataType + ']'    
    else
    Begin
        Set @DtSql = @DtSql + '[' + @DataType + ']'
        Set @c = 1
    End

    Fetch next from data_type
    into @DataType
End

close data_type
deallocate data_type
Finally, we write out @Sql with the pivot syntax, and pivot on the list of values in @DtSql. We execute @Sql and get the results.
Set @Sql = 'Select TableName, ' + @DtSql
Set @Sql = @Sql + '
From
(Select ColumnName, DataType, TableName From ##Tables) p
Pivot
(
    Count(ColumnName)
    For DataType in (' + @DtSql + ' )
) as pvt'

print @sql
exec(@Sql)
The final results will be similar to the table below:
TableNameBitVarcharIntDateTime
Dates01042
Test3020

Example 2

The scenario for this next example is: You have a table that tracks activities when accounts log on. You are asked to find the average number of days between logins for the accounts (and you ignorantly nod your head, not knowing the mess you are about to get into).
The problem: There is no way to hard code the values to pivot on, making the pivot syntax difficult, if not unusable. The complexity of finding the average number of days between activities can sound easy, not realizing that your data is stored vertically, in this situation, and you need to have it pivot horizontally, then unpivot once you have the days between activities. Sounds confusing, eh?
Here is a small break down of what the tables will be doing:
They will start out vertically, with one activity per row:
IDDate
120080701
120080705
120080710
220080615
220080627
220080702
220080711
220080714
Next, the activities will be laid out horizontally:
ID54321
1200807012008070520080710NULLNULL
22008061520080627200807022008071120080714
After laying the table out horizontally, we will use the DateDiff function to find the difference in days:
IDDateDiff5DateDiff4DateDiff3DateDiff2
14500
212593
Finally, we unpivot and get the averages.
Now, let's get into the code. For this example, we're going to be using a small data set. The following code will create a global temporary table and populate it:
--temp is the table we're going to use
if object_id('tempdb..##Temp') is not null 
    drop table ##Temp

Create Table ##Temp
(
    Id int
    ,Date datetime
)

--Insert test values
Insert Into ##Temp
select 1, '20080701'
union all select 1, '20080705'
union all select 1, '20080710'
union all select 2, '20080615'
union all select 2, '20080627' 
union all select 2, '20080702' 
union all select 2, '20080711' 
union all select 2, '20080714'
Next, we define the parameters that will be used, then find the maximum number of activities that happened for a single ID. This number is how wide the table will be. We find that number by using a simple subquery.
--Declare variables
Declare @Sql varchar(max)
,@Counter int
,@Max int

--Set counter to the most activity
Set @Counter = (Select Max(cnt) 
    From (Select Count(*) as cnt From ##temp group by Id) t)
Set @Max = @Counter
After we get the maximum width, the table will be created. A while loop is used to create the columns with the counter number as the column name. The SQL statement is dynamically created, then executed.
--Create a table with the width of the most activity
Set @Sql = 'Create Table ##Activity ( id int '

While @Counter > 0
Begin
    Set @Sql = @Sql + ',[' + cast(@counter as varchar) + '] datetime'
    Set @Counter = @Counter - 1
End

Set @Sql = @Sql + ')'
--Make sure table doesn't exist
if object_id('tempdb..##Activity') is not null
    Drop Table ##Activity
Exec (@sql)
Once the table is created, it will now get populated by the following code. The code will look into the table with the data (##temp), and get the minimum date that has not been inserted yet. It will insert all values for the IDs at once, which is done by sub-querying. The queries are executed in the while loop. After this code runs, the ##Activity table will look like the second one in the breakdown. Notice that we reset the counter back to the maximum width. This will be done before any step that loops.
--Insert id's into the table that was created
Insert Into ##Activity (id)
Select distinct Id From ##Temp

--Reset Counter
Set @Counter = (Select Max(cnt) 
    From (Select Count(*) as cnt From ##temp group by Id) t)

--Loop through and insert dates
While @Counter > 0
Begin
    If (@Counter = @Max)
    Begin
        Set @Sql = 'Update ##Activity Set [' + Cast(@Counter as varchar)+ '] '
        Set @Sql = @Sql + '=FirstActivity from ##Activity a '
        Set @Sql = @Sql + 'Left Join (Select a.Id, min(Date) as FirstActivity From 
                    ##Activity a '
        Set @Sql = @Sql + 'Left Join ##Temp t on a.Id = t.Id '
        Set @Sql = @Sql + 'Group By a.id) t on a.id = t.Id'
        
        Exec(@Sql)
        Set @Counter = @Counter - 1
    End
    Else
    Begin
        Set @Sql = 'Update ##Activity Set [' + Cast(@Counter as varchar) + '] '
        Set @Sql = @Sql + '= Activity from ##Activity a '
        Set @Sql = @Sql + 'Left Join (Select a.Id, Min(Date) as Activity '
        Set @Sql = @Sql + 'From ##Activity a Left Join ##Temp t on '
        Set @Sql = @Sql + 'a.id = t.id where Date > '
        Set @Sql = @Sql + '[' + Cast(@Counter + 1 as varchar) + '] '
        Set @Sql = @Sql + 'group by a.id) t on a.id = t.id'
        
        Exec(@Sql)
        Set @Counter = @Counter - 1
    End
End
Next, we use the DateDiff function to find the difference between days in the horizontal table. The Coalescefunction is used if the date is null (indicating that the width is shorter for that ID) and sets the difference to 0. Coalescing the nulls to 0 will be accounted for when we do the final average. As usual, a loop is used to loop through all of the columns that have been dynamically created. Since the Insert Into syntax is being used, there will be another table created.
--Reset Counter
Set @Counter = (Select Max(cnt) 
    From (Select Count(*) as cnt From ##temp group by Id) t)

--Create Table with days between activity
Set @Sql = 'Select id '
While @Counter > 1
Begin
    Set @Sql = @Sql + ',Coalesce(DateDiff(dd,'
    Set @Sql = @Sql + '[' + Cast(@Counter as varchar) + '], '
    Set @Sql = @Sql + '[' + Cast(@Counter - 1 as varchar) + ']),0) '
    Set @Sql = @Sql + 'As [DateDiff' + Cast(@Counter as varchar) + ']'
    
    Set @Counter = @Counter - 1
End

Set @Sql = @Sql + ' into ##Averages from ##Activity'

if object_id('tempdb..##Averages') is not null
    drop table ##Averages
Exec(@sql)
The final part to this puzzle is to unpivot the data. Since we have a table that is in the right format, we can use the unpivot syntax. while loops are used twice to go through the columns: one time for the upper unpivot syntax, and one time for the lower unpivot syntax. The unpivot query is a sub-query of the Avg query. In the query that gets the average, a where clause is used to average only the columns that have a value greater than 0.
--Reset Counter
Set @Counter = (Select Max(cnt) 
    From (Select Count(*) as cnt From ##temp group by Id) t)

--Unpivot
Set @Sql = 'Select id, Avg(Cast(avgs as float))' 
Set @sql = @Sql + ' from (select id, DiffDays,avgs '
Set @Sql = @Sql + ' from (select id'

while @Counter > 1
Begin
    Set @Sql = @Sql + ',[datediff' + Cast(@Counter as varchar) + ']'
    Set @Counter = @counter - 1
End

--Reset Counter
Set @Counter = (Select Max(cnt) 
    From (Select Count(*) as cnt From ##temp group by Id) t)

Set @Sql = @Sql + ' from ##averages) a unpivot (avgs for diffdays in ('

While @Counter > 1
Begin
    if(@counter = @Max)
    Begin
        Set @Sql = @Sql + '[datediff' + Cast(@counter as varchar) + ']'
        Set @Counter = @Counter - 1
    End
    Else
    Begin
        Set @Sql = @Sql + ',[datediff' + cast(@Counter as varchar) + ']'
        Set @Counter = @Counter - 1
    End
End

Set @Sql = @Sql + ')) as unpvt) f where avgs > 0 group by id'

Exec (@Sql)
After all of that has executed, the results will turn out to be:
IDAverageDays
14.5
27.5
If you haven't noticed yet, we haven't used the pivot syntax. Why is that? Since we were building the table horizontally as we went on, we didn't need to use the pivot statement. Also, the pivot syntax wouldn't have helped much because we would have ended up doing the same steps with the population and the finding the difference in days. The unpivot statement was used because we have a table that was in a good format for that statement.

Ending notes

The evil software vendors don't always make data sets that can be used for quick and easy BI analysis that your evil boss wants.
Thanks for reading this far (I know it's quite a big article). Good luck!

The code as a whole

Sample 1:

if object_id('tempdb..##Tables') is not null 
    drop table ##Tables

select 
    c.Name As ColumnName
    ,t.Name As DataType
    ,o.Name As TableName
Into ##Tables
from sys.columns c
join sys.objects o on 
    o.object_id = c.object_id
    And o.type_desc = 'USER_TABLE'
join sys.types t on
    t.system_type_id = c.system_type_id

Declare @sql varchar(max)
,@DtSql varchar(max)
,@DataType varchar(100)
,@c bit

Set @DtSql = ''
Set @c = 0

Declare data_type cursor 
for Select Distinct DataType 
From ##Tables
Order By 1

Open data_type
fetch next from data_type
into @DataType

While @@Fetch_Status = 0
Begin
    if(@c = 1)
        Set @dtSql = @DtSql + ',[' + @DataType + ']'    
    else
    Begin
        Set @DtSql = @DtSql + '[' + @DataType + ']'
        Set @c = 1
    End

    Fetch next from data_type
    into @DataType
End

close data_type
deallocate data_type

print @DtSql

Set @Sql = 'Select TableName, ' + @DtSql
Set @Sql = @Sql + '
From
(Select ColumnName, DataType, TableName From ##Tables) p
Pivot
(
    Count(ColumnName)
    For DataType in (' + @DtSql + ' )
) as pvt'

print @sql
exec(@Sql)

Sample 2

--temp is the table we're going to use
if object_id('tempdb..##Temp') is not null 
    drop table ##Temp

Create Table ##Temp
(
    Id int
    ,Date datetime
)

--Insert test values
Insert Into ##Temp
select 1, '20080701'
union all select 1, '20080705'
union all select 1, '20080710'
union all select 2, '20080615'
union all select 2, '20080627' 
union all select 2, '20080702' 
union all select 2, '20080711' 
union all select 2, '20080714' 

--Declare variables
Declare @Sql varchar(max)
,@Counter int
,@Max int

--Set counter to the most activity
Set @Counter = (Select Max(cnt) 
    From (Select Count(*) as cnt From ##temp group by Id) t)
Set @Max = @Counter

--Create a table with the width of the most activity
Set @Sql = 'Create Table ##Activity ( id int '

While @Counter > 0
Begin
    Set @Sql = @Sql + ',[' + cast(@counter as varchar) + '] datetime'
    Set @Counter = @Counter - 1
End

Set @Sql = @Sql + ')'
--Make sure table doesn't exist
if object_id('tempdb..##Activity') is not null
    Drop Table ##Activity
Exec (@sql)

--Insert id's into the table that was created
Insert Into ##Activity (id)
Select distinct Id From ##Temp

--Reset Counter
Set @Counter = (Select Max(cnt) 
    From (Select Count(*) as cnt From ##temp group by Id) t)

--Loop through and insert dates
While @Counter > 0
Begin
    If (@Counter = @Max)
    Begin
        Set @Sql = 'Update ##Activity Set [' + Cast(@Counter as varchar)+ '] '
        Set @Sql = @Sql + '=FirstActivity from ##Activity a '
        Set @Sql = @Sql + 'Left Join (Select a.Id, min(Date) as FirstActivity From 
                    ##Activity a '
        Set @Sql = @Sql + 'Left Join ##Temp t on a.Id = t.Id '
        Set @Sql = @Sql + 'Group By a.id) t on a.id = t.Id'
        
        Exec(@Sql)
        Set @Counter = @Counter - 1
    End
    Else
    Begin
        Set @Sql = 'Update ##Activity Set [' + Cast(@Counter as varchar) + '] '
        Set @Sql = @Sql + '= Activity from ##Activity a '
        Set @Sql = @Sql + 'Left Join (Select a.Id, Min(Date) as Activity '
        Set @Sql = @Sql + 'From ##Activity a Left Join ##Temp t on '
        Set @Sql = @Sql + 'a.id = t.id where Date > '
        Set @Sql = @Sql + '[' + Cast(@Counter + 1 as varchar) + '] '
        Set @Sql = @Sql + 'group by a.id) t on a.id = t.id'
        
        Exec(@Sql)
        Set @Counter = @Counter - 1
    End
End

--Reset Counter
Set @Counter = (Select Max(cnt) 
    From (Select Count(*) as cnt From ##temp group by Id) t)

--Create Table with days between activity
Set @Sql = 'Select id '
While @Counter > 1
Begin
    Set @Sql = @Sql + ',Coalesce(DateDiff(dd,'
    Set @Sql = @Sql + '[' + Cast(@Counter as varchar) + '], '
    Set @Sql = @Sql + '[' + Cast(@Counter - 1 as varchar) + ']),0) '
    Set @Sql = @Sql + 'As [DateDiff' + Cast(@Counter as varchar) + ']'
    
    Set @Counter = @Counter - 1
End

Set @Sql = @Sql + ' into ##Averages from ##Activity'

if object_id('tempdb..##Averages') is not null
    drop table ##Averages
Exec(@sql)


--Reset Counter
Set @Counter = (Select Max(cnt) 
    From (Select Count(*) as cnt From ##temp group by Id) t)

--Unpivot
Set @Sql = 'Select id, Avg(Cast(avgs as float))' 
Set @sql = @Sql + ' from (select id, DiffDays,avgs '
Set @Sql = @Sql + ' from (select id'

while @Counter > 1
Begin
    Set @Sql = @Sql + ',[datediff' + Cast(@Counter as varchar) + ']'
    Set @Counter = @counter - 1
End

--Reset Counter
Set @Counter = (Select Max(cnt) 
    From (Select Count(*) as cnt From ##temp group by Id) t)

Set @Sql = @Sql + ' from ##averages) a unpivot (avgs for diffdays in ('

While @Counter > 1
Begin
    if(@counter = @Max)
    Begin
        Set @Sql = @Sql + '[datediff' + Cast(@counter as varchar) + ']'
        Set @Counter = @Counter - 1
    End
    Else
    Begin
        Set @Sql = @Sql + ',[datediff' + cast(@Counter as varchar) + ']'
        Set @Counter = @Counter - 1
    End
End

Set @Sql = @Sql + ')) as unpvt) f where avgs > 0 group by id'

Exec (@Sql)

History

  • 7/14/2008 - Published article.

No comments:

Post a Comment