Mittwoch, 8. Juni 2011

How to compute the hash code of a file using C#

Hash codes are usefull to find objects in a databas and to check the integrity of a file. This article shows how to compute the RIPEMD-160 checksum of a file using C#

Code sample

The following code sample shows how to compute the checksum of a file using C#. The ComputeHash Method of the RIPEMD160 object creates the hashcode and returns a byte array. The byte array is then transformed to a string of hex numbers.

using System;
using System.Text;
using System.IO;
using System.Security.Cryptography;

namespace HashCodeTut
{
    class Program
    {        static void Main(string[] args)
        {
             if 
(args.Length == 0)
                 return;

             string 
hashcode = ComputeHash(args[0]);
             Console.WriteLine(hashcode);
        }
       

        static string  ComputeHash(string filename)
        {
             FileInfo 
fi = new FileInfo(filename);
             FileStream fs = fi.OpenRead();

             RIPEMD160 ripemd = new RIPEMD160Managed();
             byte[] hashbytes = ripemd.ComputeHash(fs);
             fs.Close();
         
             StringBuilder
sb = new StringBuilder();
            
             foreach (byte hb in hashbytes)             
                 sb.AppendFormat(
"{0:X2}", hb);
           

             return sb.ToString();
        }
    }
}

Hash function benchmark

Find a benchmark of the hashing functions RIPEMD160, SHA1, SHA256, SHA384, SHA512 and MD5 here:
Benchmark of hash functions included in Microsoft .Net Framework 4

DiskHash File Compare Tool

The DiskHash File Comper Tool creates hashcodes to detect file system changes.
Download: DiskHash v1.0 File Compare Tool

Donnerstag, 2. Juni 2011

Tutorial 4: How to import RSS feeds into MS SQL 2008 (Part 4)

Part 1 | Part 2 | Part 3 | Part 4

The foregoing parts explained how to create the SSIS package to import 
RSS feeds. The 4th and last part of this series covers how to setup the database
tables and stored procedures, which are called be the import package.


The headline table

The table 'fan.TutorialHeadline' is the destination of the RSS feed import. See
it's definition below.

-------------------------------------
-- final import destination
-------------------------------------

CREATE TABLE fan.TutorialHeadline
(
     
id int PRIMARY KEY IDENTITY (0,1),
     
publisher varchar(50),
     title
varchar(255NOT NULL,
     
content varchar(MAX) NULL,
     import_date datetime  DEFAULT GETDATE(),
     
linkurl varchar(255NOT NULL,
     merge_hash
binary(16NULL
ON [PRIMARY]
GO

CREATE
NONCLUSTERED INDEX [Ix_merge_hash] ON [fan].[TutorialHeadline]
(
     
[merge_hash] ASC
)
GO


The import buffer table

To avoid duplicate entries we do not update the headline table directly. We
import the RSS feed data into another import table and merge just the missing
rows into the headline table. In this tutorial the definition of both the
headline and the import tables are identical. See the definition of the import
table below.

-------------------------------------
-- import buffer table
-------------------------------------

CREATE TABLE fan.TutorialImport
(
    
id int PRIMARY KEY IDENTITY (0,1),
    publisher varchar(50),
    title 
varchar(255NOT NULL,
    content varchar(MAXNULL,
    import_date datetime DEFAULT GETDATE(),
    linkurl varchar(255NOT NULL,
    merge_hash binary(16NULL
ON [PRIMARY]
GO

CREATE
NONCLUSTERED INDEX [Ix_merge_hash] ON [fan].[TutorialImport]
(
     
[merge_hash] ASC
)
GO


The hashing function

You may have noticed the indexed column merge_hash in both tables. Hence RSS
doesn't provide a unique ID to identify a RSS data item, I create a unique
hashcode, which is calculated from the feed title and content using MD5. The
function is used by the stored procedure which imports the data. See the
function definition below.

-----------------------------------
--- hash function
-----------------------------------


CREATE
FUNCTION fan.TutorialHash
(@title varchar(100), @content varchar(MAX))
RETURNS binary(16)
AS
BEGIN
    -- create merge hash
    
DECLARE @hash binary(16);
    DECLARE @tc8000 varchar(8000);    SET @tc8000 CONVERT(varchar(8000), cast(@title as varchar(MAX)) ' ' + @content);
    SET @hash CAST(HASHBYTES('MD5',@tc8000AS binary(16));
    RETURN @hash
    END
GO


The import procedure

The stored procedure 'fan.TutorialAddHeadline' is called by SSIS package
created in the previous parts of this tutorial. The procedure is executed for
each item in a RSS feed. It creates the unique hashcode and creates a new row in
the import buffer table.

------------------------------------------
-- TutorialAddHeadline procedure
------------------------------------------

CREATE
PROCEDURE fan.TutorialAddHeadline
@title
varchar(100), @content varchar(MAX), @publisher varchar(50), @linkurl varchar(255)
AS
BEGIN
    -- create merge hash
    
DECLARE @hash binary(16);
    SET @hash fan.TutorialHash(@title@content);

    -- create headline    INSERT INTO fan.TutorialImport (publishertitlecontentlinkurlmerge_hash)
    VALUES (@publisher@title@content@linkurl@hash);
END
GO


Merge recently imported and existing data

The stored procedure 'fan.TutorialMerge' copies all rows from the import
buffer table to the destination table. It uses the MERGE command, which is new
in MS SQL 2008. The MERGE command uses the calculated hash code to find missing
rows in the destination table and inserts only new rows. After the merge the
import buffer table is cleared.
This procedure has to be called by SSIS package after the import. The next
step describes how to add this last task to the SSIS package. See the merge
procedure code listing below.


---------------------------------------
-- merge imported and existing data
---------------------------------------


CREATE
PROCEDURE fan.TutorialMerge
AS
BEGIN
    MERGE 
fan.TutorialHeadline USING fan.TutorialImport m
    ON h.merge_hash m.merge_hash
    WHEN NOT MATCHED BY TARGET THEN
        INSERT (publishertitlecontentimport_datelinkurlmerge_hash)
          VALUES (m.publisherm.titlem.contentm.import_datem.linkurlm.merge_hash);
    DELETE FROM fan.TutorialImport;
END
GO


Adding the merge procedure to the SSIS package

The merge procedure 'fan.TutorialMerge' has to be executed after the import
to update the RSS data. Follow the steps below to add it to the SSIS package.
1. Add a new 'Execute SQL' taks to the SSIS package
2. Click on the foreach loop container and wait for the arrow to appear. Then
drag it on the 'Execute SQL' task
3. Right click the 'Execute SQL' task and choose edit
4. Make sure the task connects to your database and then edit the attribute
SQLStatement and enter the following code. Replace 'CelebrityDB' by your
database name.

USE CelebrityDB -- replace db name
GO

EXEC fan.TutorialMerge


5. Click OK to close the editor

The SSIS package is now complete. You can now deploy it on the server. When
you execute the package,  it should look like the one on the image below.

Executed SSIS package:


Final Result

After the import the RSS feed data is stored in the table 'fan.TutorialHeadline'.
You can read the RSS feeds now with your SQL Server Management Studio.

Result of the RSS import:




Thank you for reading this tutorial!

More tutorials are coming soon. Your feedback helps me a lot to improve this
blog. Please leave a comment!

Mittwoch, 1. Juni 2011

Tutorial 3: How to import RSS feeds into MS SQL 2008 (Part 3)

Part 1 | Part 2 | Part 3 | Part 4

In the previous part of this tutorial you created the script tasks and
containers to download RSS feeds and iterate through every XML file. This part
shows the code listing of the earlier created RSS import Script Task inside the
Foreach Loop Container.

Place the following Code listings inside the RSS import Script Task.

Using Directives

Add the following using directives at the top of the C# script.

using System.Data.SqlClient;
using System.Data.SqlTypes;
using System.Xml;


ScriptMain class

Replace the existing Main() method by the following code. The Main() method is
executed for each XML file in the earlier specified folder. If you are familiar
with ADO.Net programming, notice how different the Connection object is created
and that it's Close() method is never called. The SQL import itself is done by
the AddHeadline() method. It's called for every item in the RSS feed XML file.
The method calls a stored procedure named 'fan.TutorialAddHeadline', which is
described in the next part of this tutorial.

Things you might want to change in the code:

1. Server name: Change 'Data Source=sql1' in the ConnectDB() method to
your server name

2. Database name: Change 'Initial Catalog=CelebrityDB' in the ConnectDB() method
to your db name



private string CurrentFilePath;
private SqlConnection con;

public void Main()
{

      // TODO: Add your code here
      CurrentFilePath = (string)Dts.Variables["CurrentFilePath"].Value;
      try
      
{
         ConnectDB();
         ImportRSS(CurrentFilePath);
         Dts.TaskResult = (int)ScriptResults.Success;
     }
     catch (Exception)
     {
         Dts.TaskResult = (int)ScriptResults.Failure;
     }
 }


private void ConnectDB()
{
    ConnectionManager cm = Dts.Connections.Add("ADO.NET");
     cm.ConnectionString =
 "Data Source=sql1;Initial Catalog=CelebrityDB;Integrated Security=SSPI;"
;
     con = (SqlConnection)cm.AcquireConnection(null);
 }


private void AddHeadline(string pTitle, string pContent,string pPublisher, string pLinkurl)
{
     SqlCommand c = con.CreateCommand();
     c.CommandType = CommandType.StoredProcedure;

     // Enter name of your stored procedure here
     c.CommandText = "fan.TutorialAddHeadline";

     // Change parameter names of  your stored procedure here
     SqlParameter sqlTitle = new SqlParameter("@title", pTitle);
     SqlParameter sqlContent = new SqlParameter("@content", pContent);
     SqlParameter sqlPublisher = new SqlParameter("@publisher", pPublisher);
     SqlParameter sqlLink = new SqlParameter("@linkurl", pLinkurl);

     sqlTitle.Direction = ParameterDirection.Input;
     sqlContent.Direction = ParameterDirection.Input;
     sqlPublisher.Direction = ParameterDirection.Input;
     sqlLink.Direction = ParameterDirection.Input;

     c.Parameters.Add(sqlTitle);
     c.Parameters.Add(sqlContent);
     c.Parameters.Add(sqlPublisher);
     c.Parameters.Add(sqlLink);

     c.ExecuteNonQuery();
}


private void ImportRSS(string filename)
{
    XmlDocument doc = new XmlDocument();
    doc.Load(filename);

    XmlNamespaceManager xnm = new XmlNamespaceManager(doc.NameTable);
    xnm.AddNamespace("dc","http://purl.org/dc/elements/1.1/");

    XmlNodeList itemList = doc.GetElementsByTagName("item");

    XmlNode xcurrent, xtitle, xlink, xdescription, xcreator;
    string title , content, publisher, link;

    for
(int i = 0; i < itemList.Count; i++)
    {
         xcurrent = itemList[i];
         xtitle = xcurrent.SelectSingleNode("title");
         xlink = xcurrent.SelectSingleNode("link");
         xdescription = xcurrent.SelectSingleNode("description");
         xcreator = xcurrent.SelectSingleNode("dc:creator", xnm);
         title = ""; content = ""; publisher = ""; link = "";
         if (xtitle != null)
             title = xtitle.InnerText;
        if (xdescription != null)
            content = xdescription.InnerText;
        if (xcreator != null)
            publisher = xcreator.InnerText;     
        if (xlink != null)
            link =  xlink.InnerText;

        AddHeadline(title, content, publisher, link);
     }
 }




Next: Tutorial Part 4

In the next part the stored procedures and tables are created and the imported data is merged using the new MERGE SQL Command.

Tutorial 2: How to import RSS feeds into MS SQL 2008 (Part 2)

Part 1 | Part 2 | Part 3 Part 4

Part 1 explained how to download RSS feeds and save them into a folder. This part covers how to wait for all downloads to complete and iterate through all downloaded files.

Wait until all downloads are completed

In the previous part we created several independent script tasks to download RSS feeds. They are executed concurrently. Before we can import the data we have to wait till all script tasks are finished. The logical object, which represents all tasks as a whole is the Sequence Container. It will be used later to define
the Precedence Constraint.(blue arrow in the image)

1. Create a new Sequence Container

2. Put all Script Tasks into the sequence container

3. Make sure the properties FailPackageOnFailure and FailParentOnFailure are set to 'False' because we don't want the package to fail when a single website is temporarily offline.

All script tasks put into a sequence container:


Iterate through all downloaded XML files

The Foreach Loop Container iterates through all files in a specified folder
and passes the filename as a parameter to another Script Task, which imports the
XML data.

1. Create a Foreach Loop Container and edit it using the Container Editor

2. On the Collection page choose the file-enumerator

3. Enter the path of the folder where the downloaded RSS feeds are saved

4. Enter into the field Files: *.xml

5. Select full qualified filenames

6. On the Variables Mappings Page create a new variable with index 0 and name
'CurrentFilePath'. It should be located in namespace 'Package' and should be of
type 'String'

7. Done, click OK to close the editor.


Add the Precedence Constraint

In this step we add the precedence constraint, which tells the foreach loop
container to start iterating through all files when the downloads are completed.
Though the foreach loop ist still empty. The XML import task will be added in
the next step.

1. Click on the Sequence Container, we created earlier and wait for the green
arrow to appear

2. Drag and drop the green arrow on the Foreach Loop Container

3. Right click on the green arrow and choose edit.

4. Set it's Evaluation Expression to Constraint

5. Set it's Value to Completion

6. Select Logical AND and click on OK to close the Editor


Create the Script Taks for the XML import task

1. Create a Script Task and place it inside the Foreach Loop Container

2. Right click on it and choose edit

3. Insert 'CurrentFilePath' into the ReadOnlyVariables Property

4. Click on OK to close the editor

The package should now look like the image below.

Package with empty Script Task for the XML import:


Next: Tutorial Part 3

The next part shows the code listing of the RSS Import Task

Tutorial 1: How to import RSS feeds into MS SQL 2008 (Part 1)

Part 1 | Part 2 | Part 3 Part 4
RSS 
or Really Simple Syndication is a standardized XML format for publishing web 
blogs or frequently updated web sites. MS SQL 2008, out of the box is capable of importing XML files. Though the built in tools raise errors when trying to import RSS feeds. This article offers a brief tutorial on how to download and import RSS feeds from a webpage using the SQL Server Business Intelligence Development Studio.



What can you do with RSS feeds?

First of all there are numerous applications which can download and view RSS
feeds. So what can you do with SQL? The main reason why you would want to import
a RSS feed into a database is to do further processing, send notifications or
apply filters on one or more feeds. Media monitoring is one possible use case
for importing RSS feeds into a database. SQL server allows you to search in RSS
feeds for brand names, company names, subjects matters, persons and any other
keyword and use the information for report generation.

What do you have to consider when processing RSS feeds?

When you consume RSS feeds you have to consider the terms and conditions of
the RSS publisher and the copyright laws of the publishers country. In many
cases the publisher may not want you to publish the feed data on a website or
intranet portal. Though in countries like Austria and Germany a company is
allowed to create paper copies of the publishers data for their own use. So
aggregating the data and showing them on a printable report is a possible
solution.

Tutorial: How to download and import RSS feeds into MS SQL 2008
This tutorial is written for developers and IT professionals who are familiar
with SQL Server Business Intelligence Studio and SQL. It shows first how to
download and parse the RSS feed using a SSIS package and shows then how to setup
the database table and procedures.

Downloading RSS feeds

1. Create a new Integragtion Services Project using the SQL Server Business
Intelligence Studio.

2. Then add a new script task and edit its source code. I use C#

3. On top of the script place the import of the System.Net Namespace if it's
not already there.
using System.Net

4. Add the method DownloadFeed below the Main() method and replace the URL
and filename. The URL is the RSS feed URL. The filename is the destination,
where it will be stored. In this tutorial all feeds will be saved in the same
folder. The import routine then reads all RSS XML files in the folder.
public bool DownloadFeed()
{
  WebClient web = new WebClient();
  try
  {
    web.DownloadFile(
      @"http://feeds.thehollywoodgossip.com/TheHollywoodGossip?format=xml",
      @"e:\import\TheHollywoodGossip.xml");
    return true;
  }
  catch (WebException)
  {
    return false;
  }
  finally
  {
     web.Dispose();
  }
}

5. Insert the following code into the Main() method
public void Main()
{
  // TODO: Add your code here
  if (DownloadFeed())
    Dts.TaskResult = (int)ScriptResults.Success;
  else
    Dts.TaskResult = (int)ScriptResults.Failure;
}

6.Repeat steps 1 to 5 for each xml feed you want to download and save your
package. The main advantage of creating seperate script tasks for each RSS feed
is that they can be executed parallel. The final result should look like the
image below.


The created script tasks:




Next: Tutorial Part 2 

Learn in the next part how to parse the RSS xml files after downloading.