Fine Tune Tiny Llama

FineTunningTinyLlamaNLPSeleniumData PreprocessingWeb ScrapingPythonData CollectionDataPreparationStyleMimicryLlamaFactory

Monday, May 6, 2024

This project aims to fine-tune the Tiny Llama model using the Llama Factory to mimic my professor's writing style. The process involves several phases, including data collection, preprocessing, preparation, model fine-tuning, and evaluation. The final goal is to create a model that can generate text in the style of my professor's academic writings.

Phases of the Project

Phase 1: Data Collection

The first step in this project was to collect data by scraping my professor's Google Scholar page. The objective was to gather a comprehensive set of research articles published by the professor.

  • Tool Used: Selenium
  • Details: Selenium was used to automate the process of accessing the Google Scholar page and downloading the available PDFs of the research articles.

Phase 2: Data Preprocessing

After collecting the PDFs, the next step was to preprocess these documents to ensure they were in a usable format for training the model.

  • Purpose: Normalize the content while preserving the writing style.
  • Tools Used: pyMuPDF
  • Steps:
  • Remove page headers, footers, images, and tables along with their captions.
  • Convert the remaining content into paragraph format, as individual words and phrases are insufficient for capturing writing style.

Phase 3: Data Preparation

The preprocessed data needed to be formatted according to the requirements of the Llama Factory model training process.

  • Initial Tools Tried: spaCy, TF-IDF, BERT
  • Tool That Worked: OpenAI API
  • Process:
  • Use the OpenAI API to generate the required data format.
  • Ensure that the data is structured correctly for input into the Llama Factory model.

Phase 4: Model Fine-Tuning

With the data prepared, the next phase involved fine-tuning the Tiny Llama model.

  • Environment: Google Colab
  • Tools Used: BytePair, Llama Factory
  • Steps:
  • Set up the Google Colab notebook and import necessary libraries.
  • Load the Llama Factory UI and integrate the dataset.
  • Define the prompt format and other configurations required by Llama Factory.
  • Run the fine-tuning process to train the Tiny Llama model on the professor's writing style.

Phase 5: Model Evaluation

The final phase focused on evaluating the performance of the fine-tuned model to ensure it accurately mimics the professor's writing style.

  • Process:
  • Generate sample texts using the fine-tuned model.
  • Compare the generated texts with the original writings to assess similarity in style and content.
  • Make any necessary adjustments and re-train if needed.

Getting Started

Prerequisites

  • Python 3.x
  • Selenium
  • pyMuPDF
  • OpenAI API

Installation

  • Clone the repository
https://github.com/Muneeb1030/FineTune-Tiny-Llama.git
  • Install the necessary Python packages
pip install selenium pymupdf openai

Usage

  1. Use Selenium to scrape the Google Scholar page and download the research articles.Example script:
  2. Preprocess the downloaded PDFs to remove unwanted elements.
  3. Format the preprocessed data using the OpenAI API.
  4. Use Google Colab to fine-tune the Tiny Llama model.
  5. Evaluate the model's performance by generating sample texts and comparing them with the original writings.