Skip to content

A tool to convert PDF slides into markdown format with AI-powered content analysis

License

Notifications You must be signed in to change notification settings

evereven-tech/ragazza

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Ragazza

The name "Ragazza" was chosen as it's memorable, simple, and cleverly contains "RAG" (Retrieval-Augmented Generation) within its spelling.

A tool to convert PDF slides into markdown format with AI-powered content analysis, suitable for loading into LLM Models.

Features

  • Extracts text content from PDF slides
  • Generates visual descriptions using Claude Models (or any other AWS Bedrock availables)
  • Provides educational purpose analysis for each slide
  • Supports error handling and retry mechanisms
  • Progress tracking with tqdm
  • Comprehensive logging

Installation

Using pip

pip install ragazza

From source

  1. Clone the repository:

    git clone https://github.com/evereven-tech/ragazza.git
    cd ragazza
  2. Use make commands for development:

    make help         # Show all available commands
    make install      # Install package in production mode
    make install-dev  # Install in development mode with dev dependencies
    make build        # Build package distribution
    make lint         # Check style with flake8
    make test         # Run tests
    make clean        # Clean up build artifacts

System Dependencies

You'll need to install Poppler for PDF processing:

# Ubuntu/Debian
sudo apt-get install poppler-utils

# macOS
brew install poppler

# Windows: Download and add Poppler to PATH

AWS Configuration

This tool requires access to AWS Bedrock to use Claude AI models for content analysis:

  1. Ensure you have an AWS account with Bedrock access enabled
  2. Request model access for Claude models in the AWS Bedrock console
  3. Configure your AWS credentials:
    pip install awscli
    aws configure
  4. Enter your AWS Access Key ID, Secret Access Key, and set default region to 'us-east-1'

Important notes:

  • Using AWS Bedrock incurs costs based on token usage
  • AWS Bedrock may not be available in all regions
  • Your AWS user/role needs permissions for 'bedrock:InvokeModel'
  • If you don't have AWS Bedrock access, this tool cannot function properly

Usage

Basic usage:

ragazza input.pdf output.md

Advanced options:

ragazza --model "anthropic.claude-3-5-sonnet-20241022-v2:0" --max-tokens 1000 input.pdf output.md

Output

The script generates:

  • A markdown file with structured content for each slide
  • Temporary images in ./tmp directory (automatically cleaned up)
  • A log file (ragazza.log) with processing details

About

A tool to convert PDF slides into markdown format with AI-powered content analysis

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published