# PySpark Sales Data Analysis

A simple PySpark application that analyzes sales data using distributed computing and statistical analysis.

## What it does

This app processes sample sales data to:
- Calculate revenue by product category
- Find top-selling products
- Perform statistical analysis (mean, median, correlations)
- Detect price outliers

## Quickstart

1. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

2. Run the analysis:
   ```bash
   python sales_data.py
   ```

## Set up virtual environment

Install requirements.txt and build virtual environment with Docker:
```bash
# arm64
docker buildx build --platform linux/arm64 --target export --output type=local,dest=. .

# amd64
docker buildx build --platform linux/amd64 --target export --output type=local,dest=. .
```

## Files

- `sales_data.py` - Main application
- `requirements.txt` - Python dependencies
- `Dockerfile` - Virtual environment setup
