Fine Tune Tiny Llama
Monday, May 6, 2024
This project aims to fine-tune the Tiny Llama model using the Llama Factory to mimic my professor's writing style. The process involves several phases, including data collection, preprocessing, preparation, model fine-tuning, and evaluation. The final goal is to create a model that can generate text in the style of my professor's academic writings.
Phases of the Project
Phase 1: Data Collection
The first step in this project was to collect data by scraping my professor's Google Scholar page. The objective was to gather a comprehensive set of research articles published by the professor.
- Tool Used: Selenium
- Details: Selenium was used to automate the process of accessing the Google Scholar page and downloading the available PDFs of the research articles.
Phase 2: Data Preprocessing
After collecting the PDFs, the next step was to preprocess these documents to ensure they were in a usable format for training the model.
- Purpose: Normalize the content while preserving the writing style.
- Tools Used: pyMuPDF
- Steps:
- Remove page headers, footers, images, and tables along with their captions.
- Convert the remaining content into paragraph format, as individual words and phrases are insufficient for capturing writing style.
Phase 3: Data Preparation
The preprocessed data needed to be formatted according to the requirements of the Llama Factory model training process.
- Initial Tools Tried: spaCy, TF-IDF, BERT
- Tool That Worked: OpenAI API
- Process:
- Use the OpenAI API to generate the required data format.
- Ensure that the data is structured correctly for input into the Llama Factory model.
Phase 4: Model Fine-Tuning
With the data prepared, the next phase involved fine-tuning the Tiny Llama model.
- Environment: Google Colab
- Tools Used: BytePair, Llama Factory
- Steps:
- Set up the Google Colab notebook and import necessary libraries.
- Load the Llama Factory UI and integrate the dataset.
- Define the prompt format and other configurations required by Llama Factory.
- Run the fine-tuning process to train the Tiny Llama model on the professor's writing style.
Phase 5: Model Evaluation
The final phase focused on evaluating the performance of the fine-tuned model to ensure it accurately mimics the professor's writing style.
- Process:
- Generate sample texts using the fine-tuned model.
- Compare the generated texts with the original writings to assess similarity in style and content.
- Make any necessary adjustments and re-train if needed.
Getting Started
Prerequisites
- Python 3.x
- Selenium
- pyMuPDF
- OpenAI API
Installation
- Clone the repository
https://github.com/Muneeb1030/FineTune-Tiny-Llama.git
- Install the necessary Python packages
pip install selenium pymupdf openai
Usage
- Use Selenium to scrape the Google Scholar page and download the research articles.Example script:
- Preprocess the downloaded PDFs to remove unwanted elements.
- Format the preprocessed data using the OpenAI API.
- Use Google Colab to fine-tune the Tiny Llama model.
- Evaluate the model's performance by generating sample texts and comparing them with the original writings.