Introduction
Pivoting data is sometimes necessary to create BI data sets. It can be easy to do, and it also can be a hard task to complete, especially if you do not know how wide your data is going to be when it pivots out. Using T-SQL, you can dynamically create the pivot queries instead of spending hours trying to find the values that need to be pivoted, or writing out a long aggregate query.
The code
I have two examples: One pivots data from system tables and counts the number of columns it has with a certain data type. The second one finds the average number of days between dates on a table that is used for activity tracking. I'll break down the code and explain what is going on. If you are not familiar with the pivot and unpivot syntaxes, read up on those before going on.
First example
In this example, we are using the following system tables: sys.columns, sys.objects, and sys.types. We are going to find the number of columns that the tables have with the data types which they use. We could hard-code all of the data-types, then have a bunch of unused columns with 0 in them for every table, or we could dynamically make a list of the data-types that are used, then pivot from there.
First, we'll select the data into a global temporary table (##Tables).
Collapse | Copy Code
if object_id('tempdb..##Tables') is not null
drop table ##Tables
select
c.Name As ColumnName
,t.Name As DataType
,o.Name As TableName
Into ##Tables
from sys.columns c
join sys.objects o on
o.object_id = c.object_id
And o.type_desc = 'USER_TABLE'
join sys.types t on
t.system_type_id = c.system_type_id
What we have in ##Tables is a list of all of the columns with their data type and parent table.
Next, a cursor is used and goes through the distinct data-types that we have in ##Table. We add each data-type that is in the table into
@DtSql
, which will be used later. Collapse | Copy Code
Declare @sql varchar(max)
,@DtSql varchar(max)
,@DataType varchar(100)
,@c bit
Set @DtSql = ''
Set @c = 0
Declare data_type cursor
for Select Distinct DataType
From ##Tables
Order By 1
Open data_type
fetch next from data_type
into @DataType
While @@Fetch_Status = 0
Begin
if(@c = 1)
Set @dtSql = @DtSql + ',[' + @DataType + ']'
else
Begin
Set @DtSql = @DtSql + '[' + @DataType + ']'
Set @c = 1
End
Fetch next from data_type
into @DataType
End
close data_type
deallocate data_type
Finally, we write out
@Sql
with the pivot syntax, and pivot on the list of values in @DtSql
. We execute @Sql
and get the results. Collapse | Copy Code
Set @Sql = 'Select TableName, ' + @DtSql
Set @Sql = @Sql + '
From
(Select ColumnName, DataType, TableName From ##Tables) p
Pivot
(
Count(ColumnName)
For DataType in (' + @DtSql + ' )
) as pvt'
print @sql
exec(@Sql)
The final results will be similar to the table below:
TableName | Bit | Varchar | Int | DateTime |
Dates | 0 | 10 | 4 | 2 |
Test | 3 | 0 | 2 | 0 |
Example 2
The scenario for this next example is: You have a table that tracks activities when accounts log on. You are asked to find the average number of days between logins for the accounts (and you ignorantly nod your head, not knowing the mess you are about to get into).
The problem: There is no way to hard code the values to pivot on, making the pivot syntax difficult, if not unusable. The complexity of finding the average number of days between activities can sound easy, not realizing that your data is stored vertically, in this situation, and you need to have it pivot horizontally, then unpivot once you have the days between activities. Sounds confusing, eh?
Here is a small break down of what the tables will be doing:
They will start out vertically, with one activity per row:
ID | Date |
1 | 20080701 |
1 | 20080705 |
1 | 20080710 |
2 | 20080615 |
2 | 20080627 |
2 | 20080702 |
2 | 20080711 |
2 | 20080714 |
Next, the activities will be laid out horizontally:
ID | 5 | 4 | 3 | 2 | 1 |
1 | 20080701 | 20080705 | 20080710 | NULL | NULL |
2 | 20080615 | 20080627 | 20080702 | 20080711 | 20080714 |
After laying the table out horizontally, we will use the
DateDiff
function to find the difference in days:ID | DateDiff5 | DateDiff4 | DateDiff3 | DateDiff2 |
1 | 4 | 5 | 0 | 0 |
2 | 12 | 5 | 9 | 3 |
Finally, we unpivot and get the averages.
Now, let's get into the code. For this example, we're going to be using a small data set. The following code will create a global temporary table and populate it:
Collapse | Copy Code
--temp is the table we're going to use
if object_id('tempdb..##Temp') is not null
drop table ##Temp
Create Table ##Temp
(
Id int
,Date datetime
)
--Insert test values
Insert Into ##Temp
select 1, '20080701'
union all select 1, '20080705'
union all select 1, '20080710'
union all select 2, '20080615'
union all select 2, '20080627'
union all select 2, '20080702'
union all select 2, '20080711'
union all select 2, '20080714'
Next, we define the parameters that will be used, then find the maximum number of activities that happened for a single ID. This number is how wide the table will be. We find that number by using a simple subquery.
Collapse | Copy Code
--Declare variables
Declare @Sql varchar(max)
,@Counter int
,@Max int
--Set counter to the most activity
Set @Counter = (Select Max(cnt)
From (Select Count(*) as cnt From ##temp group by Id) t)
Set @Max = @Counter
After we get the maximum width, the table will be created. A
while
loop is used to create the columns with the counter number as the column name. The SQL statement is dynamically created, then executed. Collapse | Copy Code
--Create a table with the width of the most activity
Set @Sql = 'Create Table ##Activity ( id int '
While @Counter > 0
Begin
Set @Sql = @Sql + ',[' + cast(@counter as varchar) + '] datetime'
Set @Counter = @Counter - 1
End
Set @Sql = @Sql + ')'
--Make sure table doesn't exist
if object_id('tempdb..##Activity') is not null
Drop Table ##Activity
Exec (@sql)
Once the table is created, it will now get populated by the following code. The code will look into the table with the data (##temp), and get the minimum date that has not been inserted yet. It will insert all values for the IDs at once, which is done by sub-querying. The queries are executed in the
while
loop. After this code runs, the ##Activity table will look like the second one in the breakdown. Notice that we reset the counter back to the maximum width. This will be done before any step that loops. Collapse | Copy Code
--Insert id's into the table that was created
Insert Into ##Activity (id)
Select distinct Id From ##Temp
--Reset Counter
Set @Counter = (Select Max(cnt)
From (Select Count(*) as cnt From ##temp group by Id) t)
--Loop through and insert dates
While @Counter > 0
Begin
If (@Counter = @Max)
Begin
Set @Sql = 'Update ##Activity Set [' + Cast(@Counter as varchar)+ '] '
Set @Sql = @Sql + '=FirstActivity from ##Activity a '
Set @Sql = @Sql + 'Left Join (Select a.Id, min(Date) as FirstActivity From
##Activity a '
Set @Sql = @Sql + 'Left Join ##Temp t on a.Id = t.Id '
Set @Sql = @Sql + 'Group By a.id) t on a.id = t.Id'
Exec(@Sql)
Set @Counter = @Counter - 1
End
Else
Begin
Set @Sql = 'Update ##Activity Set [' + Cast(@Counter as varchar) + '] '
Set @Sql = @Sql + '= Activity from ##Activity a '
Set @Sql = @Sql + 'Left Join (Select a.Id, Min(Date) as Activity '
Set @Sql = @Sql + 'From ##Activity a Left Join ##Temp t on '
Set @Sql = @Sql + 'a.id = t.id where Date > '
Set @Sql = @Sql + '[' + Cast(@Counter + 1 as varchar) + '] '
Set @Sql = @Sql + 'group by a.id) t on a.id = t.id'
Exec(@Sql)
Set @Counter = @Counter - 1
End
End
Next, we use the
DateDiff
function to find the difference between days in the horizontal table. The Coalesce
function is used if the date is null
(indicating that the width is shorter for that ID) and sets the difference to 0. Coalescing the null
s to 0 will be accounted for when we do the final average. As usual, a loop is used to loop through all of the columns that have been dynamically created. Since the Insert Into
syntax is being used, there will be another table created. Collapse | Copy Code
--Reset Counter
Set @Counter = (Select Max(cnt)
From (Select Count(*) as cnt From ##temp group by Id) t)
--Create Table with days between activity
Set @Sql = 'Select id '
While @Counter > 1
Begin
Set @Sql = @Sql + ',Coalesce(DateDiff(dd,'
Set @Sql = @Sql + '[' + Cast(@Counter as varchar) + '], '
Set @Sql = @Sql + '[' + Cast(@Counter - 1 as varchar) + ']),0) '
Set @Sql = @Sql + 'As [DateDiff' + Cast(@Counter as varchar) + ']'
Set @Counter = @Counter - 1
End
Set @Sql = @Sql + ' into ##Averages from ##Activity'
if object_id('tempdb..##Averages') is not null
drop table ##Averages
Exec(@sql)
The final part to this puzzle is to unpivot the data. Since we have a table that is in the right format, we can use the unpivot syntax.
while
loops are used twice to go through the columns: one time for the upper unpivot syntax, and one time for the lower unpivot syntax. The unpivot query is a sub-query of the Avg query. In the query that gets the average, a where
clause is used to average only the columns that have a value greater than 0. Collapse | Copy Code
--Reset Counter
Set @Counter = (Select Max(cnt)
From (Select Count(*) as cnt From ##temp group by Id) t)
--Unpivot
Set @Sql = 'Select id, Avg(Cast(avgs as float))'
Set @sql = @Sql + ' from (select id, DiffDays,avgs '
Set @Sql = @Sql + ' from (select id'
while @Counter > 1
Begin
Set @Sql = @Sql + ',[datediff' + Cast(@Counter as varchar) + ']'
Set @Counter = @counter - 1
End
--Reset Counter
Set @Counter = (Select Max(cnt)
From (Select Count(*) as cnt From ##temp group by Id) t)
Set @Sql = @Sql + ' from ##averages) a unpivot (avgs for diffdays in ('
While @Counter > 1
Begin
if(@counter = @Max)
Begin
Set @Sql = @Sql + '[datediff' + Cast(@counter as varchar) + ']'
Set @Counter = @Counter - 1
End
Else
Begin
Set @Sql = @Sql + ',[datediff' + cast(@Counter as varchar) + ']'
Set @Counter = @Counter - 1
End
End
Set @Sql = @Sql + ')) as unpvt) f where avgs > 0 group by id'
Exec (@Sql)
After all of that has executed, the results will turn out to be:
ID | AverageDays |
1 | 4.5 |
2 | 7.5 |
If you haven't noticed yet, we haven't used the pivot syntax. Why is that? Since we were building the table horizontally as we went on, we didn't need to use the pivot statement. Also, the pivot syntax wouldn't have helped much because we would have ended up doing the same steps with the population and the finding the difference in days. The unpivot statement was used because we have a table that was in a good format for that statement.
Ending notes
The evil software vendors don't always make data sets that can be used for quick and easy BI analysis that your evil boss wants.
Thanks for reading this far (I know it's quite a big article). Good luck!
The code as a whole
Sample 1:
Collapse | Copy Code
if object_id('tempdb..##Tables') is not null
drop table ##Tables
select
c.Name As ColumnName
,t.Name As DataType
,o.Name As TableName
Into ##Tables
from sys.columns c
join sys.objects o on
o.object_id = c.object_id
And o.type_desc = 'USER_TABLE'
join sys.types t on
t.system_type_id = c.system_type_id
Declare @sql varchar(max)
,@DtSql varchar(max)
,@DataType varchar(100)
,@c bit
Set @DtSql = ''
Set @c = 0
Declare data_type cursor
for Select Distinct DataType
From ##Tables
Order By 1
Open data_type
fetch next from data_type
into @DataType
While @@Fetch_Status = 0
Begin
if(@c = 1)
Set @dtSql = @DtSql + ',[' + @DataType + ']'
else
Begin
Set @DtSql = @DtSql + '[' + @DataType + ']'
Set @c = 1
End
Fetch next from data_type
into @DataType
End
close data_type
deallocate data_type
print @DtSql
Set @Sql = 'Select TableName, ' + @DtSql
Set @Sql = @Sql + '
From
(Select ColumnName, DataType, TableName From ##Tables) p
Pivot
(
Count(ColumnName)
For DataType in (' + @DtSql + ' )
) as pvt'
print @sql
exec(@Sql)
Sample 2
Collapse | Copy Code
--temp is the table we're going to use
if object_id('tempdb..##Temp') is not null
drop table ##Temp
Create Table ##Temp
(
Id int
,Date datetime
)
--Insert test values
Insert Into ##Temp
select 1, '20080701'
union all select 1, '20080705'
union all select 1, '20080710'
union all select 2, '20080615'
union all select 2, '20080627'
union all select 2, '20080702'
union all select 2, '20080711'
union all select 2, '20080714'
--Declare variables
Declare @Sql varchar(max)
,@Counter int
,@Max int
--Set counter to the most activity
Set @Counter = (Select Max(cnt)
From (Select Count(*) as cnt From ##temp group by Id) t)
Set @Max = @Counter
--Create a table with the width of the most activity
Set @Sql = 'Create Table ##Activity ( id int '
While @Counter > 0
Begin
Set @Sql = @Sql + ',[' + cast(@counter as varchar) + '] datetime'
Set @Counter = @Counter - 1
End
Set @Sql = @Sql + ')'
--Make sure table doesn't exist
if object_id('tempdb..##Activity') is not null
Drop Table ##Activity
Exec (@sql)
--Insert id's into the table that was created
Insert Into ##Activity (id)
Select distinct Id From ##Temp
--Reset Counter
Set @Counter = (Select Max(cnt)
From (Select Count(*) as cnt From ##temp group by Id) t)
--Loop through and insert dates
While @Counter > 0
Begin
If (@Counter = @Max)
Begin
Set @Sql = 'Update ##Activity Set [' + Cast(@Counter as varchar)+ '] '
Set @Sql = @Sql + '=FirstActivity from ##Activity a '
Set @Sql = @Sql + 'Left Join (Select a.Id, min(Date) as FirstActivity From
##Activity a '
Set @Sql = @Sql + 'Left Join ##Temp t on a.Id = t.Id '
Set @Sql = @Sql + 'Group By a.id) t on a.id = t.Id'
Exec(@Sql)
Set @Counter = @Counter - 1
End
Else
Begin
Set @Sql = 'Update ##Activity Set [' + Cast(@Counter as varchar) + '] '
Set @Sql = @Sql + '= Activity from ##Activity a '
Set @Sql = @Sql + 'Left Join (Select a.Id, Min(Date) as Activity '
Set @Sql = @Sql + 'From ##Activity a Left Join ##Temp t on '
Set @Sql = @Sql + 'a.id = t.id where Date > '
Set @Sql = @Sql + '[' + Cast(@Counter + 1 as varchar) + '] '
Set @Sql = @Sql + 'group by a.id) t on a.id = t.id'
Exec(@Sql)
Set @Counter = @Counter - 1
End
End
--Reset Counter
Set @Counter = (Select Max(cnt)
From (Select Count(*) as cnt From ##temp group by Id) t)
--Create Table with days between activity
Set @Sql = 'Select id '
While @Counter > 1
Begin
Set @Sql = @Sql + ',Coalesce(DateDiff(dd,'
Set @Sql = @Sql + '[' + Cast(@Counter as varchar) + '], '
Set @Sql = @Sql + '[' + Cast(@Counter - 1 as varchar) + ']),0) '
Set @Sql = @Sql + 'As [DateDiff' + Cast(@Counter as varchar) + ']'
Set @Counter = @Counter - 1
End
Set @Sql = @Sql + ' into ##Averages from ##Activity'
if object_id('tempdb..##Averages') is not null
drop table ##Averages
Exec(@sql)
--Reset Counter
Set @Counter = (Select Max(cnt)
From (Select Count(*) as cnt From ##temp group by Id) t)
--Unpivot
Set @Sql = 'Select id, Avg(Cast(avgs as float))'
Set @sql = @Sql + ' from (select id, DiffDays,avgs '
Set @Sql = @Sql + ' from (select id'
while @Counter > 1
Begin
Set @Sql = @Sql + ',[datediff' + Cast(@Counter as varchar) + ']'
Set @Counter = @counter - 1
End
--Reset Counter
Set @Counter = (Select Max(cnt)
From (Select Count(*) as cnt From ##temp group by Id) t)
Set @Sql = @Sql + ' from ##averages) a unpivot (avgs for diffdays in ('
While @Counter > 1
Begin
if(@counter = @Max)
Begin
Set @Sql = @Sql + '[datediff' + Cast(@counter as varchar) + ']'
Set @Counter = @Counter - 1
End
Else
Begin
Set @Sql = @Sql + ',[datediff' + cast(@Counter as varchar) + ']'
Set @Counter = @Counter - 1
End
End
Set @Sql = @Sql + ')) as unpvt) f where avgs > 0 group by id'
Exec (@Sql)
History
- 7/14/2008 - Published article.
No comments:
Post a Comment