Web Scraping Techniques-Scraping Using HTML Agility Pack II

This article is the second part of our series on scraping website using HTML Agility-Pack. Having tackled the cover and contents page in the previous article, we’re now ready to put the main content. Let’s start off with:

HTML Selectors

Selectors allow you to select HTML node from HTML document.

HTML Selector Methods

SelectNodes() :  Collect a bunch of nodes matching the X-path expression.

SelectSingleNode(String): Collect the first Xml-Node that matching the X-Path expression.

SelectNodes Method

Selects a list of nodes matching the HtmlAgilityPack.HtmlNode.XPath expression.

Parameters: xpath(The XPath expression.)

Returns: An HtmlAgilityPack.HtmlNodeCollection containing   a collection of nodes matching the HtmlAgilityPack. HtmlNode.XPath query, or null if no node matched the X-Path expression.

Examples:

(i) The following example selects the first node matching the X-Path expression using SelectNodes method.

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(html);

string name = htmlDoc.DocumentNode.SelectNodes("//td/input").First().Attributes["value"].Value;

(ii) The following example selects all nodes which are matching the XPath expression.

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(html);

var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//td/input");

SelectSingleNode Method

Selects first HtmlNode matching the HtmlAgilityPack.HtmlNode.XPath expression.

Parameters: xpath(The X-Path expression. May not be null.)

Returns: The first Node that matches the X-Path query or a null reference if no matching node was found.

Example:

The following example selects the first node matching the X-Path expression using SelectNodes method.

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(html);

string name = htmlDoc.DocumentNode.SelectSingleNode("//td/input").Attributes["value"].Value;

HTML Manipulation

Manipulation allows you to cross HTML node.

HTML Manipulation Properties

InnerHtml: Gets or Sets the HTML within the start and end tags of the object.

InnerText: Gets the text between the start and end tags of the object.

OuterHtml: Gets the object and its content in HTML.

ParentNode: Gets the top most node of this node (for nodes that can have parents).

InnerHtml

public virtual string InnerHtml { get; set; } :-

Gets or Sets the HTML between the start and end tags of the object. InnerHtml is a member of HtmlAgilityPack.HtmlNode.

Example:

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(html);

var htmlNodes =htmlDoc.DocumentNode.SelectNodes("//body/h1");

foreach (var node in htmlNodes)

{

Console.WriteLine(node.InnerHtml);

}

InnerText

public virtual string InnerText { get; } :-

Gets the text between the start and end tags of the object. InnerText is a member of HtmlAgilityPack.HtmlNode .

Example:

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(html);

var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body/h1");

foreach (var node in htmlNodes)

{

Console.WriteLine(node.InnerText);

}

OuterHtml

public virtual string OuterHtml { get; } :-

Gets the object and its content in HTML. OuterHtml is a component of HtmlAgilityPack.HtmlNode.

Example:

var htmlDoc = new HtmlDocument();

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(html);

var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//body/h1");

foreach (var node in htmlNodes)

{

Console.WriteLine(node.OuterHtml);

}

HTML Manipulation Methods

AppendChild(): Combine the specified node to the end of the children’s list of this node.

AppendChildren(): Merge the identified node to the end of the list of children of this node.

Clone(): Generate the identical of the node.

CloneNode(Boolean): Generate an identical of the node.

CloneNode(String): Generate an identical of the node and modify its name at the same time.

CloneNode(String, Boolean): Generate an identical of the node and modify its name at the same time.

CopyFrom(HtmlNode): Make an identical of the node and the sub-tree under it.

CopyFrom(HtmlNode, Boolean): Make a copy of the node.

CreateNode(): Make a HTML node from a string representing literal HTML.

InsertAfter(): Inserts the enumerated node immediately after the enumerated reference node.

InsertBefore: Inserts the enumerated node immediately before the enumerated reference node.

PrependChild: Adds the enumerated node to the start of the children’s list of this node.

PrependChildren: Merge the enumerated node list to the start of the list of children of this node.

Remove: Discard node from the main collection.

RemoveAll: Discard all the children and/or attributes of the present node.

RemoveAllChildren: Delete all the children of the present node.

RemoveChild(HtmlNode): Remove the enumerated  child node.

RemoveChild(HtmlNode, Boolean): Removes the enumerated  child node.

ReplaceChild(): Replaces the child node oldChild with newChild node.

HTML Traversing

Traversing allow you to traverse through HTML node.

HTML Traversing Properties

ChildNodes: Gets all the children of the node.

FirstChild: Gets the first child of the node.

LastChild: Obtains the final child of the node.

NextSibling: Obtain the node instantly following this element.

ParentNode: Obtain the upper node of this node (for nodes that can have parents).

ChildNodes 

Example:

using system;

using system.xml;

using htmlAgelityPack;

public class Program

{

public static void Main()

{

var html=@"<body>

<h1>This is<b>bold</b>heading</h1>

<p>This is <u>underline</u>paragraph</p>

</body>";

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(html);

var htmlBody =  htmlDoc.DocumentNode.SelectSingleNodes("//body");

HtmlNodeCollection childNodes = htmlBody.ChildNodes;

foreach (var node in childNodes )

{

if(node.NodeType == HtmlNodeType.Element)

{

Console.WriteLine(node.OuterHtml);

}

}

}

}
 

Output:

<h1>This is<b>bold</b>heading</h1>

<p>This is <u>underline</u>paragraph</p>

FirstChild

Example:

using system;

using system.xml;

using htmlAgelityPack;

public class Program

{

public static void Main()

{

var html=@"<body>

<h1>This is<b>bold</b>heading</h1>

<p>This is <u>underline</u>paragraph</p>

</body>";

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(html);

var htmlBody =  htmlDoc.DocumentNode.SelectSingleNodes("//body");

HtmlNodeCollection firstChild= htmlBody.FirstChild;

Console.WriteLine(firstChild.OuterHtml);

}

}

Output:

<h1>This is<b>bold</b>heading</h1>

LastChild

Example:

using system;

using system.xml;

using htmlAgelityPack;

public class Program

{

public static void Main()

{

var html=@"<body>

<h1>This is<b>bold</b>heading</h1>

<p>This is <u>underline</u>paragraph</p>

</body>";

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(html);

var htmlBody =  htmlDoc.DocumentNode.SelectSingleNodes("//body");

HtmlNodeCollection lastChild= htmlBody.LastChild;

Console.WriteLine(lastChild.OuterHtml);

}

}

Output:

<p>This is <u>underline</u>paragraph</p>

NextSibling

Example:

using system;

using system;

using system.xml;

using htmlAgelityPack;

public class Program

{

public static void Main()

{

var html=@"<body>

<h1>This is<b>bold</b>heading</h1>

<p>This is <u>underline</u>paragraph</p>

<h2>This is<i>italic</i>heading</h2>

<h2>This is new heading</h2>

</body>";

var htmlDoc = new HtmlDocument();

htmlDoc.LoadHtml(html);

var node =  htmlDoc.DocumentNode.SelectSingleNodes("//body/h1");

HtmlNode sibling = node.NextSibling;

while(sibling != null)

{

if(sibling.NodeType == HtmlNodeType.Element)

{

Console.WriteLine(sibling.OuterHtml);

sibling = sibling.NextSibling;

}

}

}

}

Output:

<p>This is <u>underline</u>paragraph</p>

<h2>This is<i>italic</i>heading</h2>

<h2>This is new heading</h2>

HTML Traversing Methods

Ancestors(): Gets all the ancestor of the node.

Ancestors(String): Gets ancestors with matching the name.

AncestorsAndSelf(): Gets all anscestor nodes and the current node.

AncestorsAndSelf(String): Gets all ancestor nodes and the current node with matching the name.

DescendantNodes: Obtains all Descendant nodes for this node and each of child nodes.

Descendants(): Obtain all Descendant nodes in enumerated list.

Descendants(String): Get all descendant nodes with matching the name.

DescendantsAndSelf(): Returns a group of all descendant nodes of this element, in document order.

DescendantsAndSelf(String): Gets all descendant nodes including this node.

Element: Gets first generation child node matching name.

Elements: Gets matching first generation child nodes matching the name.

HTML Writer

Save HtmlDocument & Write HtmlNode.

HTML Writer Methods (HtmlDocument)

Save(Stream): Saves the HTML document to the specified stream.

Save(StreamWriter): Saves the HTML document to the specified StreamWriter.

Save(TextWriter): Reserves the HTML document to the enumerated TextWriter.

Save(String): Reserves the mixed document to the enumerated file.

Save(XmlWriter): Reserves  the HTML document to the enumerated  XmlWriter.

Save(Stream, Encoding): Watch over the HTML document to the enumerated stream.

Save(String, Encoding): Watch over the mixed document to the enumerated file.

HTML Writer Methods (HtmlNode)

WriteContentTo(): Saves all the children of the node to a string.

WriteContentTo(TextWriter): Reserves all the children of the node to the enumerated TextWriter.

WriteTo(): Reserves the current node to a string.

WriteTo(TextWriter): Reserves the current node to the enumerated TextWriter.

WriteTo(XmlWriter): Reserves the current node to the enumerated XmlWriter.

HTML Utilities

HTML Utilities Methods (HtmlDocument)

DetectEncoding(Stream): Find out the encoding of an HTML stream.

DetectEncoding(TextReader): Find out the encoding of an HTML text provided on a TextReader.

DetectEncoding(String): Find out the encoding of an HTML file.

DetectEncodingAndLoad(String): Find out the encoding of an HTML document from a file first, and then loads the file.

DetectEncodingAndLoad(String, Boolean): Find out the encoding of an HTML document from a file first, and then loads the file.

HTML Attributes

HTML Attributes Methods

Add(HtmlAttribute): Adds supplied item to collection.

Add(String, String): Adds a new attribute to the collection with the given values.

Append(String): Creates and inserts a new attribute as the last attribute in the collection.

Append(HtmlAttribute): Inserts the specified attribute as the last attribute in the collection.

Append(String, string): Creates and inserts a new attribute as the last attribute in the collection.

Remove(): Removes all attributes from the collection.

Remove(String): Removes an attribute from the list, using its name. If there are more than one attributes with this name, they will all be removed.

Remove(HtmlAttribute): Removes a given attribute from the list.

RemoveAll(): Remove all attributes in the list.

RemoveAt(): Removes the attribute at the specified index.

SetAttributeValue(): Sets the value of an attribute, adds an attribute, or removes an attribute. If the attribute is not found, it will be created automatically.

I have covered a lot of ground in very little code which I hope this post shows us how to effectively parse HTML documents in C# and further impresses on you the power of this library!

Latest posts by Rahul Huria (see all)

1 thought on “Web Scraping Techniques-Scraping Using HTML Agility Pack II”

  1. Hello Kumar, excellent blog, and your html agility pack articles will be very useful on my next project. I have a question… How can I access a website that required login, and continue navigating and scraping a site? Thanks!

    Reply

Leave a Comment