Projects with this topic
-
DataRider bloc for ETL Stream with Spark+Scala
Updated -
Dashboard Comercial desenvolvido do zero com modelagem relacional no DBeaver (SQL), pipeline de transformações no Power Query e visualizações interativas no Power BI. Projeto de estudo para portfólio.
Updated -
Configuration and data workflows for an instance of Apache Airflow for the DDRplatform
Updated -
Python async video metadata processing pipeline with multi-source ingestion and ETL transforms.
Updated -
End-to-end AWS data lake pipeline for fleet telemetry data using S3, Spark, and Athena. Includes partitioned Parquet ETL, vehicle safety analytics, and SQL queries for overspeed and harsh braking detection.
Updated -
Plain text boilerplate removal using character n-gram frequency across a corpus. Builds a template model from a sample, filters files in a single linear pass, and validates automatically. Includes an obfuscated mode where the model is a set of integers and output filenames are hashed: the operator never reads the content. AWK for character processing, Bash for orchestration, Lisp layer planned for positional classification.
Updated -
This project/library contains common elements related to ETL processes...
Updated -
This is a study project built using FastAPI to practice microservice architecture, data normalization techniques, and clean API design.
The service receives raw payloads from different simulated sources and transforms them into a standardized and validated structure.
It centralizes normalization logic and demonstrates how to build a scalable, maintainable, and test-friendly data processing layer.
Updated -
Solución end-to-end para la migración y análisis de datos utilizando Python, FastAPI, Kafka y PostgreSQL. Implementa un pipeline de datos asíncrono y una API RESTful para analíticas, todo completamente containerizado con Docker Compose para un despliegue fácil y reproducible.
Updated -
Unified project demonstrating both batch analytics and real-time streaming pipelines with Apache Spark:
Batch (PySpark/Jupyter): Processed S&P 500 stock data, applied transformations, and ran distributed computations.
Streaming (Spark + Kafka): Built a streaming pipeline to consume Kafka topics, process messages in real-time, and visualize outputs.
Deployed using Docker and Jupyter for reproducibility.
Updated -
Analyzed decades of historical weather station data (1920–1940) using Hadoop MapReduce. Filtered operable stations, computed descriptive statistics (min, max, mean, median), and produced reports/graphs. Designed modular MRJobs to chain tasks together for scalable processing.
Updated -
Advanced data synchronization framework.
Updated -
Reporting for MIT Club of Northern California
Updated -
В данном проекте находятся два задания, написанные на Python и реализующие выполнение цепочек задач (DAG) в среде Airflow
Updated -
-
Crawl and extract Home Depot's schema.org/Products.
Updated