How I formed a chatbot on the GitHub standards using an AI and LLM scraper | By Victor Yakubu | February 2025
5 mins read

How I formed a chatbot on the GitHub standards using an AI and LLM scraper | By Victor Yakubu | February 2025


Build a chatbot powered by AI to analyze GitHub standards using scraped data and LLM

Stackademic

I built a chatbot fed by AI which can answer questions about the Github standards by extracting key information from the standard data. I used Bright Data data for the AI ​​web scraper to collect structured data and formed a chatbot using the Olllama Phi3 model to analyze and interact with the data.

In this article, I will guide you:

✅ How I obtained GitHub repository data using Bright Data GitHub data for the AI ​​web scraper.
✅ Form a chatbot with Olllama Phi3 model.
✅ Implementation of a GitHub Insights tool based on rationals For real -time interactions.
✅ Lessons learned and the impact of the use of AI for the analysis of the repository.

To form the chatbot, I needed a high quality data set containing the details of the key repository. Instead of scratching the github manually, I used Bright Data’s Ai Brockingwhich provided a structured and automated means of collecting repository data.

They have two methods of scratching data from any website using their web scrapers. Web Grabyers have a scraper API and the Gratoir without code that anyone can use.

  1. Register on light data and click Web scrapers in the left shutter.

If you are a new user that connects you, you will get $ 5 free to try their services for 7 days.

2. “Gitub” In the search bar and click on the first result.

3. A list of GitHub scrapers will appear. Select “Github repository – Collect URL“For this use case.

4. Select the Gratoir without code.

5. Click “Add an input” To add your required Github repository links, then click on “Start collecting”.

6. Once the status field has posted “Ready, “Click “Download” and choose CSV like the format.

This project uses Python for data processing and rationalizing for a simple user interface.

Prerequisite

  • Any code editor of your choice.
  • Python installed (version 3.8+ recommended).

Step 1: Project configuration

  1. Create the project file:
mkdir github-insights-tool
cd github-insights-tool

2. Configure a virtual environment:

python -m venv venv

3. Activate the environment:

venv\Scripts\activate
source venv/bin/activate

4. Install dependencies:

pip install pandas streamlit langchain_community

Streamline – to build the user interface

  • Pandas – to manage the operations of the data set
  • Langchain (Ollla)) – for the analysis of the AI ​​repository

Project structure:

github-insights-tool/
│── github.csv #your dataset from Bright Data
│── ai.py

Step 2: Installation and execution of the chatbot (Ollama Phi3 model) locally

This tool fueled by AI generates information on the GitHub standards by analyzing their strengths, their weaknesses and their conviviality. It also provides key repository details without asking to navigate several sections on Github.

Why Olllama?

  • Free and easy to configure
  • Executed locally without internet dependence
  • Provides quick and customizable responses

Installation of Olllama

Olllama provides a simple CLI tool to locally execute models of large languages ​​(LLMS). Install it according to your operating system:

curl -LO  && start ollama.exe
curl -fsSL  | sh
brew install ollama

Download the Phi3 model:

ollama pull phi3

Run the Olllama model:

ollama run phi3

💡 Note: Always make sure the The Olllama model works locally Before running your code. Otherwise, the AI ​​model will not be accessible.

Step 3: Implementation of the Github Insights tool

The tool consists of the following features:

Initialization of the Ollama

import streamlit as st
import pandas as pd
from langchain_ollama import OllamaLLM

# Initialize Ollama with the chosen model
llm = OllamaLLM(model="phi3")

Loading the GitHub database

@st.cache_data
def load_github_data():
df = pd.read_csv("githubdata.csv")
df.columns = df.columns.str.strip().str.lower() # Normalize column names to lowercase
return df

Analysis of the desired benchmark using AI

def analyze_repository(repo_data, llm):
prompt = f"""
Analyze the following GitHub repository data and provide insights:
{repo_data.to_dict()}
Focus on:
1. Code quality and maintainability
2. Popularity and engagement
3. Potential use cases
4. Key strengths and weaknesses
"""
try:
return llm.invoke(prompt)
except Exception as e:
return f"Error generating analysis: {e}"

This function generates information according to the quality of the code, commitment and potential use cases.

Interact with the analysis generated by AI

def interact_with_analysis(analysis, query, llm):
prompt = f"""
Based on the following analysis:
{analysis}
Answer the user's query: {query}
"""
try:
return llm.invoke(prompt)
except Exception as e:
return f"Error processing query: {e}"

This allows users to interact with the analysis generated by the IA of the repository.

Step 4: Definition of the Streamlit application

Basic characteristics

  • Allows users to enter a GitHub URL (this is one of the URLs present in the CSV file, so that it can provide responses adapted to this specific GitHub repository).
  • Initiate the AI ​​Chatbot interaction based on analysis.
def main():
# Add GitHub logo next to the title
st.markdown("""<h1 style='display: flex; align-items: center;'>
<img src=' width='40' style='margin-right:10px;'>
GitHub Repository Insights Tool
</h1>""", unsafe_allow_html=True)

github_df = load_github_data()

# User input field for entering a GitHub repository URL
repo_url = st.text_input("Enter GitHub Repository URL")
analysis_result = ""

if repo_url:
# Filter the dataset based on the entered URL
repo_data = github_df[github_df["url"] == repo_url]
if not repo_data.empty:
repo_data = repo_data.iloc[0]

# Display repository details
st.subheader("Repository Details")
st.write(f"Language: {repo_data['code_language']}")
st.write(f"Stars: {repo_data['num_stared']}")
st.write(f"Forks: {repo_data['num_fork']}")
st.write(f"Pull Requests: {repo_data['num_pull_requests']}")
st.write(f"Last Feature: {repo_data['last_feature']}")
st.write(f"Latest Update: {repo_data['latest_update']}")

# Display repository owner details
st.subheader("Owner Details")
st.write(f"Owner: {repo_data['user_name']}")
st.write(f"URL: {repo_data['url']}")

# AI-powered analysis of the repository
st.subheader("AI Analysis")
if st.button("Generate Analysis"):
with st.spinner("Analyzing repository..."):
analysis_result = analyze_repository(repo_data, llm)
st.session_state["analysis"] = analysis_result # Store analysis in session state
st.write(analysis_result)
else:
st.warning("Repository not found in the dataset. Please enter a valid URL.")

# AI Chatbot interaction based on the generated analysis
if "analysis" in st.session_state:
st.subheader("Chat with AI about this Repository")
user_query = st.text_input("Ask a question about the repository analysis")
if user_query:
with st.spinner("Processing query..."):
response = interact_with_analysis(st.session_state["analysis"], user_query, llm)
st.write(response)

# Run the Streamlit application
if __name__ == "__main__":
main()

Application execution:

On your terminal, run this order:

streamlit run app.py

Step 5: Use of the application of the Github Insights tool

  1. Paste the standard of the repository and display the analysis.

2. Click “Generate an analysis” To develop a benchmark report.

3. Interact with the chatbot for other information.

The formation of a chatbot on GitHub standards using data for the AI ​​web scraper from brilliant data and Olllama Phi3 has proven to be very effective for the automation of information on the repository. This approach saves time, improves precision and provides responses fueled by AI based on real standard data.

For developers looking for clean and structured GitHub data sets, Bright Data offers reliable data sets and loans to use and integration of API to rationalize the extraction and analysis of data.

🚀 Try it and let me know your thoughts!



Grpahic Designer