Web Crawler

Java-based Search Engine Project • Mar 2024 - May 2024

Project Overview

COMP4321-Crawler is a web crawler and search engine project developed for the COMP4321 Information Retrieval course. It focuses on building a functional search engine with a spider for web crawling, an indexer for keyword extraction, and a retrieval function that ranks results based on relevance.

Key Features

  • Spider Function - Recursively fetches pages from a given website using BFS algorithm
  • Indexer - Extracts keywords from pages and inserts them into an inverted file structure
  • Retrieval Function - Compares query terms against the inverted file and returns top-ranked documents
  • Phrase Query Support - Advanced search capabilities with phrase matching
  • Web Interface - User-friendly interface for query input and result display

System Architecture

Backend Components

  • Crawler with Breadth-First Search (BFS) implementation
  • Text processing pipeline (stop word removal, stemming)
  • N-gram extraction for enhanced search capabilities
  • JDBM database for efficient data storage and retrieval

Frontend Interface

  • Clean, intuitive search interface
  • Vector space model with cosine similarity ranking
  • Advanced search with AND/OR operators
  • Comprehensive result display with relevance scoring

Technology Stack

Programming Language: Java (OpenJDK 21.0.2)
Web Server: Apache Tomcat
Database: JDBM
Algorithms: BFS Vector Space Model Cosine Similarity

System Screenshots

Search Results Interface

Web Crawler Search Results

Advanced Search with Boolean Operators

Advanced Search with AND/OR operators

Keyword Analysis Table

Keyword frequency table

Project Links & Information

Repository

Find the source code and additional information in the COMP4321-Crawler GitHub repository:

View on GitHub

License

This project is open-source and available under the MIT License.

← Back to Home