Thursday, September 26, 2013

SQL 2008 Series: SQL Partitions / partitioning

Refer the Source Video: SQL 2008 Series: SQL Partitions / partitioning
--------------------------------------------------------------------------------------------------------------
~
i. A table is created on a Schema.
ii. Files where the data will be stored that belongs to a Group
iii.To the Schema the parition function created is assigned.
iv.Create table - the empid is passed.

~

Database properties:

File: .mdf,ldf
File Group:

example:

i. For a table empid column has values 1 - 5000
a. Requirement to store empid: 1 - 999 ( in Partition1)
1000 - 1999 ( in Partition2)

Putting the data into buckets makes data search faster.

To do that:

1. Define a Partition Function ( specify the range of data to store)
2. Define a Partition Schema ( on which the table can be created)
a.Attach the Partition Function to the Schema
b.In the Schema the File Groups are mentioned ( those are created first in Database before running the Schema statement)
c.In the Database properties, create files and specify the filegroup of it.
d.Create the table ( on the Schema) - pass the empid column

--------------------------------------------------------------------------------------------------------------

create partition function
UI:
    Create file groups in DB properties
  Create file in DB properties ( associate with the above file groups)
create partition scheme
create table & associate with the scheme.
populate table data

select $partition - partition function created earlier

alternate: UI you can right click on Table - choose "Storage" and choose create new partition and define the function & scheme and its done.

There is Manage partition , the same allows to modify the existing partitions.

Syntax:

create partition function cust_part_func(int) as range right for values(1000, 2000,3000,4000,5000)

create partition scheme cust_part_scheme as partition cust_part_func to (fgp1,fgp2,fgp3,fgp4,fgp5,fgp6)

Sql Server Management Studio: Database name ("PartitionDB") , properties, choose "filegroups" and just enter the above names. then choose "file" (say fg1)create it and associate with the filegroups just created( above fgp1).

create table partition (EMPID int identity(1,1) not null, empdate datetime null) on cust_part_scheme (empid);

declare @i int
set @i = 0
while @i< 10000
Begin
insert into [partition] (empdata) values (GETDATE());
set @i = @i +1
end

select $partition cust_part_func(EMPID) as 'partition number',* from [partition]

-- it will show until 999 the partition 1, then show the partition 2

--------------------------------------------------------------------------------------------------------------
Table make into ranges, into different buckets. Separate File Groups -so query faster.

Sql Server Management Studio:
1.right click on Database,choose properties,"File" one can see the .mdf and .ldf file defn.
Partitioning just having the .mdf file in different file groups.
(when file group defined it shows .ndf files too)

2.steps to create partition.

a.Define a partition function ( specify how we are going to divide the data.)
b.Define a schema ( associate the partition function to this schema)
c.Attach a table to it.

i.Create a Database by right click on Database and new Database named [ PartitionDB ].
ii.In the new Database "PartitionDB" right click and choose "New Query".
iii.create partition function cust_part_func(int) as range right for values(1000, 2000,3000,4000,5000)

create partition scheme cust_part_scheme as partition cust_part_func to (fgp1,fgp2,fgp3,fgp4,fgp5,fgp6)

--fgp.. are file groups (yet to be created).[note:Run after the below step]
iv.right click on "PartitionDB" , properties, choose "filegroups" and just enter the above names.
then choose "file" (say fg1)create it and associate with the filegroups just created( above fgp1).

iv. create table & associate with the scheme.
create table partition (EMPID int identity(1,1) not null, empdate datetime null) on cust_part_scheme (empid);
v.populate data.

declare @i int
set @i = 0
while @i< 10000
Begin
insert into [partition] (empdata) values (GETDATE());
set @i = @i +1
end

select $partition cust_part_func(EMPID) as 'partition number',* from [partition]

-- it will show until 999 the partition 1, then show the partition 2

Partitioning in SQL Server 2012

Maps Partition Function to Logical and Physical Files.

Thursday, September 19, 2013

SSAS:Beginner

Refer Source VideoSSAS Tutorial for Beginners Basics of multidimensional cube video

visual studio-project-Analyis Services Multidimensional Data

Star schema ( Fact - Dimension)
Solution Explorer

Data Sources:
- new connection Database:
Data Source Views:
- source the Data source : choose the tables
Cubes:
-New Cube - use existing tables - Fact Table ( Measures ) -Dimensions
Dimensions:
- creates the Dimensions on the above step
( add all that attribute reqd.)
Mining Structure
Roles

Properties:Deployment
Right click : Process

Deployed the Metadata

Click on Run button it completely deploys to the server instance.

Control Panel- Service- Sql Server Browser start

Data source:choose the user credential
Particular login does not have access to the database.

Test:Analysis Services

SSRS:Report

Refer Source Video:SSRS How to create a report video tutorial SSRS 2012
---------
Solution Explorer:
Project
Shared Data Sources:
(In the "Edit" button specify the connection, Server,Database)
Shared Data Set:
Reports:

In the Report:
1.Data Source:choose the defined "Shared Data Source" which has connection to server & Database mentioned.
2.DataSets: Choose "Query" there define the query
3.Report:Create Report by pulling "Data Sets" fields
For Report View menu->Choose Toolbox-Table
--------
Report Data
------------------
Build-in fields
Parameters
Images
Data Sources
DataSets
-field1
-field2

Solution Explorer:
-Shared Data Sources
-Shared Data Sets
-Reports
-abc.rdl
1. Go to Visual Studio 2010
2.File - New -Project - Report Server Project
3.
Solution Explorer
SSRSProject
Shared Data Sources
Shared Datasets
Reports
4.
Solution Explorer
SSRSProject
SharedData Source ( Add New Data Source:Name Type:Microsoft SQL Server
Edit (Connection: Server Name:. Use Windows Connection Database:
5.
Reports:Add:New Item:Report:Name:abc.rdl:Add
6.Data Source:use the shared Data Source

6.DataSets:Add DataSets:Name:Use a dataset embedded in my report:Data source:
Query Designer:Add table:Employee:Fields:
Select statement created : shows as report fields

5.View:Toolbox:Drag Table;choose fields

6.Formatting:

Preview

Tuesday, September 17, 2013

Tuning SSIS Package

Refer the article source:Faster Extraction & Loading by SSIS

1.Create Indexes to the joining columns before starting extraction ( especially those that has data type other than integer like varchar, nvarchar, datetime, etc.

2.Choose data type & size of the columns wisely.

3.
Try to avoid type casting or manipulating data types inside SSIS package.
Typecasting to do in source query of OLE DB Source component.
Try to populate numerical values in facts and in joining columns.
if you choose a numeric key; less space means more record per page which means less
I/O for the same amount of data. Also joining on a numeric key is usually faster than joining
on a varchar / nvarchar key.

4.SELECT only columns that you need.

5.OLE DB Destination: ( 5 types of Data Access Mode) Use Fast Load option.

The fast load option will use BULK INSERT statement instead of INSERT statement.

If the fast load option is not selected then by default INSERT is used.

Keep Identity – Keep Nulls – Table Lock – It is the recommended option (removes the overhead of lock escalation.)
Check Constraints – It is preferable to uncheck(reduce the overhead for the pipeline engine.)
Rows per batch – RowsPerBatch is the number of rows you would want in One Buffer.

(automatically sets this property based on the RowSize and MaxBufferRows property.)

Maximum insert commit size – commits rows in batches that are the smaller from either (a) the Maximum insert commit size, or (b) the remaining rows in the buffer that is currently being processed.

6.Try to run all load processes in parallel as long as possible. Run as many load processes as you have available CPUs. The number of parallel process you can run is directly proportion to the number of logical processors (hyper threading technology) you have in your system.

--deadlock may occur while accessing the same table by more than one process.

--But the Log Extract task is using the same table to log the extract information.

Parallel processes

To avoid this type of problem change the value of the property IsolationLevel from Serializable to ReadUncommitted for all Log Extract tasks. This will not put the locks on table level and thus will not trigger any lock exception.

7.Network limitations: Packet Size property of the connection manager to an integer value

The max value that you can insert is 32767.

EXEC sp_configure 'show advanced option', '1';

RECONFIGURE;

EXEC sp_configure;

EXEC sp_configure 'network packet size (B)', '32767';

RECONFIGURE WITH OVERRIDE;

RECONFIGURE;

EXEC sp_configure

8.Derived Column Transformation:

The derived column, we usually use for data type conversation or handling NULL values.

two approaches:

derived Column transformation for all the records (Without Conditional split) - more time.

only for those records that have NULL values -With Conditional Split

9.If your package and SQL Server is running on the same system, then better try to take SQL Server destination instead of the OLE DB destination, so as to increase the performance of loading of data.

10.No Sort Transformation - Instead ORDER BY clause in Source SQL statements.

11.Truncate Vs Delete:

truncate removes all data from the table on one go and maintain a short log

Delete statement log every thing on record level as a result decreasing the performance.

12.Look up transformation with Derived columns transformation:

Lookup transformation: picking some values when the data got matched

some values that may not be present in the target table like NULL values, (convert the values with some predefined values like 0 or -1 and then it will be matched to the target table. )

perform a lookup transformation first and will take all the records that are successful in lookup and do the derived column transformation for the rest of the bad records.

13.A common table expression (CTE) approach will be much easier and faster way than nesting sub queries while dealing with multiple aggregate functions.

--------------------------------------------------------------------------------------------------------------------------------

Source: SQL Server Integration Services SSIS Best Practices

Best Practice #1 - Pulling High Volumes of Data
...
Then we came with an approach to make the target table a heap by dropping all the indexes on the target table in the beginning, transfer the data to the heap and on data transfer completion, recreate indexes on the target table. With this approach, the whole process (by dropping indexes, transferring data and recreating indexes) took just 3-4 hours which was what we were expecting.

So the recommendation is to consider dropping your target table indexes if possible before inserting data to it specially if the volume of inserts is very high.
....
Best Practice #2 - Avoid SELECT *

Beware when you are using "Table or view" or "Table name or view name from variable" data access mode in OLEDB source. It behaves like SELECT * and pulls all the columns, use this access mode only if you need all the columns of the table or view from the source to the destination.

Tip : Try to fit as many rows into the buffer which will eventually reduce the number of buffers passing through the dataflow pipeline engine and improve performance.

Best Practice #3 - Effect of OLEDB Destination Settings

Data Access Mode - 'fast load' option (uses a BULK INSERT statement) instead of a simple INSERT statement.If you select the 'fast load' option, there are also a couple of other settings...

Keep Identity – Keep Nulls – Table Lock – * Check Constraints –

Best Practice #4 - Effect of Rows Per Batch and Maximum Insert Commit Size Settings

Rows per batch – The default value for this setting is -1 which specifies all incoming rows will be treated as a single batch.

Maximum insert commit size – The default value for this setting is '2147483647' (largest value for 4 byte integer type) which specifies all incoming rows will be committed once on successful completion.

You can specify a positive value for this setting to indicate that commit will be done for those number of records. You might be wondering, changing the default value for this setting will put overhead on the dataflow engine to commit several times. Yes that is true, but at the same time it will release the pressure on the transaction log and tempdb to grow tremendously specifically during high volume data transfers.

Best Practice #5 - SQL Server Destination Adapter

Local SQL Server Database:
If the target is Local SQL Server Database. Performance wise its similar as "Bulk Insert Task". Advantage over the "Bulk Insert Task" is that it transform data before uploading to Destination.
Along with OLEDB destination adapter options , SQL Server Destination Adapter has more options.
"Insert Trigger Option" which can validate business rules can be set.Can specify number of first/last rows in the input load, specify maximum number of errors thats allowed until the bulk load option is cancelled.
Insert Column sort order can be specified which is used in the upload process.

SQL Server Database on Remote Server:
Can not use SQL Server Destination Adapter , instead use OLEDB destination adapter.If Destination change from local to remote or 1 sql server instance to another instance , use OLEDB destination adapter to
minimize future changes.

Best Practice #6 - Avoid asynchronus transformation ( such as sort transformation )

SSIS works internally:
1.SSIS runtime engine: executes the package.
2.It executes every task other than "data flow task" in the defined sequence.
3."SSIS Run time Engine" hands over "data flow task" to "data flow pipeline engine".
4."data flow pipeline engine":breaks:"data flow task": 1 or more execution trees in parallel to achieve high performance.

Synchronous transformation:
1.get a rec.,process it,pass it to other transformation or destination

Refer Source article it has many more tips...

Thursday, September 12, 2013

agile and waterfall - Software as a service (SaaS)

1.. What is Agile software development ?

Agile_software_development

One of the differences between agile and waterfall, is that testing of the software is conducted at different points during the software development lifecycle. In the waterfall model, there is a separate testing phase after implementation. In Agile XP, testing is done concurrently with implemetation.
------------------------------------------------------------------------------------------------------------

2.Software as a service (SaaS) ?

software as a Service (SaaS) is a model of software deployment where an application is hosted as a service outside of the customer’s site and delivered to customers across the Internet.

saas - in nutshell

saas - deployment model
--
Software_as_a_service

sometimes referred to as "on-demand software" , is a software delivery model in which software and associated data are centrally hosted on the cloud. SaaS is typically accessed by users using a thin client via a web browser.

SaaS has become a common delivery model for many business applications, including accounting, collaboration, customer relationship management (CRM), management information systems (MIS), enterprise resource planning (ERP), invoicing, human resource management (HRM), content management (CM) and service desk management.^[

The term "software as a service" (SaaS) is considered to be part of the nomenclature of cloud computing, along with infrastructure as a service (IaaS), platform as a service (PaaS),desktop as a service (DaaS), and backend as a service (BaaS).^[8]

Architecture

While an exception rather than the norm, some SaaS solutions do not use multi-tenancy, or use other mechanisms—such as virtualization—to cost-effectively manage a large number of customers in place of multi-tenancy.^[12] Whether multi-tenancy is a necessary component for software-as-a-service is a topic of controversy.^[13]

Open integration protocols

Since SaaS applications cannot access a company's internal systems (databases or internal services), they predominantly offer integration protocols and application programming interfaces (APIs) that operate over a wide area network. Typically, these are protocols based on HTTP, REST, SOAP and JSON.

References:

Developing a SaaS product? Scrum is NOT for you.

Scrum:
More methodical way of developing.

------------------------------------------------------------------------------------------------------------

Relational - Dimensional Modelling

2. Relational - Dimensional Modelling

In simple words OLTP normalized database are designed with most optimal "transactional" point of view. Databases are normalized to work optimally to a transactional system. When I say optimization of transactional system i mean ..getting to a design state of database structure where all transactional operations like delete,insert,update and select are balanced to give equal or optimum importance to all of them at any point of time...as they are equally valued in a transactional system.

And that what a normalized system offer ..minimal updates possible for a data update,minimal insert possible for new entry,one place delete for category deletion etc (e.g. new product category )...all this is possible a we branch a create master tables .....but this comes at the cost of "select" operation delay ..but as I said its(normalization) not most efficient model for all operations ..its "Optimal"...having said we get other methods to enhance data fetching speed..like indexing etc

On the other hand Dimensional model (mostly used for data-ware house design)..meant for giving importance to only one kind of operations thats Selection of data...as in data-ware houses ..data update/insertion happens periodically ..and its a one time cost.

So if one try to tweak normalized data structure so that only selection is the most important operation at any point in time ...we will end up getting a denormalized (I would say partially denormalized)..dimensional star structure.

all the foreign keys a one place Fact -no dimension to dimension join (i.e. master to master table join)..snowflake represent same dimension
- ideally designed facts carry only numbers ..measures or foreign keys
- dimension are used to carry description and non aggregatable info
- redundancy of data is ignored ...but in rare cases if Dimensions itself grow too much .snowflake design is seen as option..but that still is avoidable

For details please go through detailed books on this topic.

What is the difference between Data Warehousing, Business Intelligence.?

1.What is the difference between Data Warehousing, Business Intelligence.?
ans.

In simple term in data warehouse data is stored. From this stored data, management make decision by looking at reports or by a data analysis tool.

In Microsoft the BI tool is Analysis Services , Reporting Services , Integration Services.

In the data warehouse more focus on data modelling, normalization and optimizing data. Deciding on Star schema (OLAP applns. - Fact Dimension Table) or Snowflake Schema ( RDBMs applns.).

Thursday, September 5, 2013

HOLAP, MOLAP and ROLAP

DWH Concepts and Fundamentals

Online Analytical Processing (OLAP)

Multidimensional OLAP (MOLAP) - multi-dimensional graph having various axes.

Each axis represents a category of business called a dimension (eg location, product, time, employee etc). The data points are the business transactional values called measures.

Relational OLAP (ROLAP) – FACT -Dimension Tables

MOLAP, ROLAP, And HOLAP