This repository contains Jupyter Notebooks related to fraud detection, data streaming, and real-time data visualization. These notebooks cover various aspects of processing, analyzing, and modeling data to address fraudulent transactions in eCommerce and other contexts.
-
Analysing Fraudulent Transaction Data.ipynb
- Purpose: Exploratory data analysis (EDA) of fraudulent transaction datasets.
- Key Components:
- Analyzing patterns in fraudulent transactions.
- Visualizing data distributions and key features.
- Libraries used:
pandas
,matplotlib
,seaborn
.
-
Building Models for eCommerce Fraud Detection.ipynb
- Purpose: Building and evaluating machine learning models for fraud detection.
- Key Components:
- Preprocessing data for model training.
- Training and evaluating models such as Logistic Regression, Random Forest, etc.
- Libraries used:
scikit-learn
,numpy
,pandas
.
-
Producing the Data.ipynb
- Purpose: Simulating and producing data streams for analysis.
- Key Components:
- Generating mock data for fraud scenarios.
- Producing data using streaming technologies.
- Libraries used:
faker
,pandas
.
-
Consuming Data Using Kafka and Visualise.ipynb
- Purpose: Consuming data streams and visualizing results.
- Key Components:
- Setting up Kafka consumers to read data streams.
- Visualizing the processed data for insights.
- Libraries used:
kafka-python
,matplotlib
.
-
Streaming Application Using Spark Structured Streaming.ipynb
- Purpose: Building a streaming application for real-time data processing.
- Key Components:
- Setting up Spark Structured Streaming.
- Processing streaming data in real-time.
- Libraries used:
pyspark
.
- Python 3.x
- Jupyter Notebook or Google Colab
- Required Python libraries:
pandas
,numpy
,matplotlib
,seaborn
scikit-learn
,faker
,kafka-python
,pyspark
- Clone the repository:
git clone <repository-url>
- Navigate to the project directory:
cd <repository-folder>
- Install the required libraries:
pip install -r requirements.txt
- Open the notebooks in Jupyter or any compatible environment (e.g., Google Colab).
- Follow the instructions within each notebook to execute the cells in sequence.
The datasets used in this project are too large to include in the repository. Please email me at [your-email@example.com] to request access to the datasets.
This project is licensed under GNU (General Public License). See the LICENSE file for details.
- Python documentation
- Open-source libraries used in the project
- Kafka and Spark community resources