Świat (Big) Data jest pełny narzędzi pozwalających na pracę z danymi. Możemy przebierać w rozwiązaniach open source, komercyjnych oraz cloud native. Niestety wybór odpowiednich narzędzi nie jest prosty. By ułatwić sobie to zadanie, zacząłem spisywać najpopularniejsze narzędzia wykorzystywane w świecie danych. Dodatkowo podzielił te narzędzia na odpowiednie grupy względem środowiska pracy oraz zastosowania.
Poniżej moje zestawienie narzędzi, mam nadzieję, że będzie przydatna nie tylko dla mnie :)
Jeśli ktoś woli format arkusza kalkulacyjnego, lista jest dostępna także tutaj.
Jeśli uważasz, że warto dodać do tej listy jakieś narzędzie, proszę o informację na maila, komentarz pod wpisem lub komentarz w arkuszu.
Tabela jest podzielona na 6 kolumn:
- Zastosowanie / Cel biznesowy - Tutaj podstawowe przeznaczenie grupy narzędzi
- Self manage (on premise or public cloud) - narzędzia którymi my zarządzamy bez względu czy zainstalujemy je w środowisku on premise czy w chmurze publicznej
- Amazon Web Services - rozwiązania natywne dla AWS
- Google Cloud Platform - rozwiązania natywne dla GCP
- Microsoft Azure - rozwiązania natywne dla Azure
- Third Party Cloud offering - rozwiązania cloud native firm trzecich dostępne w modelu PaaS lub SaaS (najczęsciej dostarczane na zasobach AWS i/lub GCP i/lub Azure)
Zastosowanie / Cel biznesowy | Self manage (on premise or public cloud) | Amazon Web Services | Google Cloud Platform | Microsoft Azure | Third Party Cloud offering |
---|---|---|---|---|---|
Data Lake (składowanie petabajtów danych jako pliki lub obiekty) | - Hadoop Distributed File System (HDFS) - Apache Ozone (S3 compatible, based on HDFS) - MinIO (S3 compatible) - Alluxio (formerly known as Tachyon, virtual distributed storage system, HDFS, S3, GCS, ABS and other) |
- Amazon Simple Storage Service (S3) (HDFS compatible) | - Google Cloud Storage (HDFS compatible) | - Azure Data Lake (gen 2) (HDFS compatible) | |
Data Catalog, Data Governance, Data lineage etc. | - Hive Metastore - Apache Atlas - Amundsen - Marquez - Datahub - OpenMetadata - EventCatalog |
- Amazon Glue Data Catalog | - Google Data Catalog | - Azure Data Catalog | |
Data LakeHouse | - HDFS + Apache Iceberg (made by Netflix), - HDFS + Apache Hudi (made by Uber), - HDFS + Databricks Delta Lake |
(to co w on premise na S3) | (to co w on premise na GCS) | (to co w on premise na ADL) | - Databricks SQL Analytics (preview) |
Data Warehouse | - IBM DB2 Warehouse, - Teradata, - Oracle Autonomouse Database / Exadata, - Vertica |
- Amazon Redshift | - Google BigQuery | - Azure Synapse (wcześniej Azure SQL Data Warehouse) | - Snowflake (AWS/GCP/Azure) |
Data Lake + Data Warehouse integration | - Apache Hive/Apache Spark/etc, - Oracle Big Data SQL, - IBM Db2 Big SQL |
- Redshift Spectrum | - Google BigQuery (external tables) | - Azure Synapse (Spark SQL) | |
Big Data Platforms (Hadoop and Spark) | - Hortonworks Data Platform (HDP) [legacy], - Cloudera Distribution for Hadoop (CDH) [legacy] - Cloudera Data Platform, - HPE Ezmeral (previous MapR), - Apache Bigtop |
- AWS EMR | - Google Dataproc | - Azure HDInsight (based on Hortonworks Data Platform) [legacy, killed by Cloudera] | - Databricks Unified Data Analytics Platform (AWS/GCP/Azure), - Cloudera Data Platform Cloud (AWS/Azure) |
SQL on DataLake | - Apache Hive, - Apache Spark SQL, - Presto (PrestoDB, Facebook), - Trino (PrestoSQL), - Apache Drill, - Cloudera Impala, - Apache Pig, - Dremio |
- AWS Athena (Serverless, based on Presto) | - Google BigQuery (external tables) | - Azure Synapse (Spark SQL) | - Databricks SQL Analytics - Ahana Cloud (Managed Presto on AWS) |
SQL relational databases | - PostgreSQL, - MySQL, - Oracle Database, - MS SQL Server |
Amazon Relational Database Service (RDS) + Amazon Aurora | Google Cloud SQL | Azure SQL Database Azure Database | |
SQL Distributed Databases | - CockroachDB | - Amazon Aurora global databases | - Google Cloud Spanner | - Azure Cosmos DB (SQL API) | |
NoSQL Databases | - Apache HBase (Hadoop Database), - Apache Cassandra / Scylla, - Accumulo |
- Amazon DynamoDB, - Amazon Keyspaces (Cassandra as Service), |
- Google Bigtable (HBase API) | - Azure CosmosDB (Cassandra API) - Azure Storage Tables |
- DataStax Astra (AWS/GCP/Azure) |
NoSQL Document Database | - MongoDB | - Amazon DocumentDB | - Google Cloud Datastore | - Azure DocumentDB, - Interfejs API Azure Cosmos DB dla bazy danych MongoDB |
- MongoDB Cloud (AWS/GCP/Azure) |
NoSQL Grapsh Database | - Neo4j | - Amazon Neptune | |||
NoSQL Cache | - Redis, - Memcache |
- Amazon ElastiCache (Redis or Memcached) | - Google Cloud Memorystore | - Azure Redis Cache | |
NoSQL search engine | - Elastic Stack (Elasticsearch, Kibana, Logstash, etc), - Apache Solr (available in big data distribution) |
- Amzon CloudSearch - Amazon Elasticsearch Service (ES) |
- Azure Search | - Elastic Cloud (AWS/GCP/Azure) | |
Brocker | - RabbitMQ - ActiveMQ - ZeroMQ |
- Amazon MQ, - Amazon Simple Queue Service (SQS), - Amazon Simple Notification Service (SNS) |
- Google Cloud Pub/Sub | - Azure Service Bus + Azure Queue Storage, - Azure Notification Hubs |
|
Streaming - Brocker |
- Apache Kafka, - Apache Pulsar |
- Amazon Kinesis Data Streams | - Google Pub/Sub, | - Azure Event Hubs | |
Stream processing | - Apache Spark [Structured] Streaming, - Apache Flink, - Apache Beam, - Kafka Streams, - Apache Storm, - Apache Heron (made by Twitter) |
- Amazon Kinesis Data Analytics (based on Apache Flink) | - Google Dataflow (Apache Beam) | - Azure Stream Analytics | |
Streaming platform | - Confluent Platform (based on Apache Kafka), - Hortonworks/Cloudera DataFlow (based on Apache Kafka and NiFi) - Ververica Platform (based on Apache Flink) |
- Amazon Managed Streaming for Apache Kafka (Amazon MSK), - Amazon Kinesis (multiple tools inside) |
- Google Dataproc (Big Data platform with Apache Kafka) | - Azure HDInsight (based on Hortonworks DataFlow) | - Confluent Cloud (AWS/GCP/Azure) |
Real time data transformation | - Kafka Connect, - Apache NiFi + MiNiFi, - Apache Flume |
- Amazon Kinesis Data Firehose | - Google Dataflow (Apache Beam) | - Azure HDInsight (NiFi), - Azure DataFactory |
|
Batch data transformation (ETL, ELT) | - Apache Beam, - Apache Spark, - dbt - Airbyte - Meltano - Apache Hop, - Twister2, - Apache Samza (made by LinkedIN) |
- Amazon Glue (serverless Apache Spark) - AWS Data Pipeline |
- Google Dataflow (Apache Beam) - Cloud DataPrep (created by Trifacta) |
- Azure DataFactory (with support for Databricks/Spark or whole HDInsight platform) | |
Integration between Relational Database and Data Lake | - Apache Sqoop, - Apache Spark |
- AWS Database Migration Service (AWS DMS) | - Azure Database Migration Service | ||
Change Data Capture | - Debezium + Kafka | - AWS Database Migration Service (AWS DMS), - Debezium + Kinesis |
- Debezium + Pub/Sub | ||
Task Orchestration | - Apache Airflow, - Apache Oozie (big data distro), - Luigi (Spotify), - Azkaban |
- AWS Step Functions - AWS Data Pipeline, - Amazon Managed Workflows for Apache Airflow (MWAA) - Amazon Simple Workflow Service |
- Google Cloud Composer (Apache Airflow) | - Azure Data Factory | - Astronomer (Managed Airflow) |
Machine Learning and/or Data Science Platform | - Anaconda - Dataiku - H2O.ai |
- Amazon SageMaker | - Google Cloud AutoML, - Google Cloud Machine Learning Engine, - Google Cloud Datalab |
- Azure Machine Learning, - Azure Machine Learning Studio |
- Databricks Unified Analytics Platform - Alteryx |
Data Science Notebooks | - Jupyter Family, - BeakerX, - Apache Zeppelin, - Polynote (made by Netflix) |
- AWS EMR Notebooks - AWS SageMaker Notebooks |
- Google Colaboratory (Colab) | Azure Notebooks (Killed by MS) | |
Data visualization | - Kibana (Elastic Stack) - Apache Superset - Redash (by Databricks) - Metabase - Tableau - Qlik - Microsoft Power BI (on premise edition) |
- AWS QuickSight | - Google Data Studio - Google Looker |
- Microsoft Power BI | - Tableau Cloud |
Production Ready ML services | (commercial offering by big tech vendors) | - AWS ML Services | - Google Cloud AI Building Blocks | - Azure Cognitive Services |