Understanding HTML Basics: Structure, Tags, and Elements
HTML (Hypertext Markup Language) serves as the foundation for web development and is essential for web scraping projects. This article explores the fundamentals of HTML, breaking down its core components and structure to help you better understand how web pages are constructed.
The Origins of HTML
The World Wide Web began as a way to share documents on the internet, which was gaining functionality through services like email and news. In 1990, English physicist Tim Berners-Lee and Belgian computer scientist Robert Cailliau introduced HTML to the world, revolutionizing how information would be shared online.
HTML Tags Explained
HTML tags are the building blocks of web pages, consisting of opening and closing brackets. They typically come in pairs (opening and closing tags), though some function as standalone or single tags. An important characteristic of HTML tags is that they are not case-sensitive, meaning <head>, <HEAD>, and <Head> are all equivalent.
Tags instruct the browser how to display content and come in two main types:
- Empty (single) tags
- Container tags (which have opening and closing components)
HTML Attributes
HTML attributes provide additional information to elements and modify their behavior or appearance. They appear within the opening tag and consist of an attribute name and an attribute value.
For example, in an image tag: <img src="image.jpg">
- img is the tag
- src is the attribute name
- “image.jpg” is the attribute value
Attributes allow for customization, such as changing text color, font size, or linking to resources.
HTML Elements
An HTML element includes everything from the opening tag to the closing tag, including the content between them. The complete structure consists of:
- Opening tag (with optional attributes)
- Content
- Closing tag
For example: <a href="contact.html">Contact Us</a>
In this element, <a href=”contact.html”> is the opening tag with an attribute, “Contact Us” is the content, and </a> is the closing tag.
The Structure of HTML Documents
HTML documents follow a clear hierarchical structure:
- DOCTYPE declaration (indicates the document type)
- <html> (the root element)
- <head> (contains metadata, title, etc.)
- <title> (defines the page title shown in browser tabs)
- <body> (contains the visible content of the page)
Within the body, you can include various elements such as paragraphs, headings, lists, images, and tables.
Common HTML Elements
Text Formatting
- <p> – Paragraph
- <h1> to <h6> – Headings (h1 is largest, h6 is smallest)
- <b> – Bold text
- <i> – Italic text
- <u> – Underlined text
- <br> – Line break
Lists
- <ol> – Ordered list (numbered)
- <ul> – Unordered list (bulleted)
- <li> – List item
Tables
HTML tables are structured with these elements:
- <table> – Defines the table
- <tr> – Table row
- <td> – Table data (cell)
Tables allow for organizing content in rows and columns, making them useful for displaying structured data.
Conclusion
Understanding HTML basics is crucial for anyone involved in web development or web scraping. By grasping the concepts of tags, attributes, elements, and document structure, you can better navigate and extract information from websites, making your web scraping projects more efficient and effective.