Understanding Data Sources:
Recognizing the nature of data sources is paramount. Sources can range from simple spreadsheets to complex transactional databases. Key considerations include:
- Data Formats: Structured (databases) or semi-structured (JSON, XML).
- ACID Properties (Transactional Databases): Atomicity, Consistency, Isolation, and Durability ensure data integrity. Relaxing these can improve performance for analytical workloads. OLTP databases are not suitable for analytics.
Connecting to Data:
Connecting to source systems can be achieved through:
- APIs: The most common approach, offering flexibility and control (using Python, Alteryx, etc.).
- Third-Party Tools: Simplify integration with pre-built connectors (e.g., Airbyte).
- Avoid direct SQL connections and file exports whenever possible.
Data Extraction Strategies:
Choosing the right extraction method is essential:
- Batch Extraction:
- Full Refresh: Overwrites existing data with the entire dataset.
- Incremental Refresh: Extracts only changed records based on timestamps or other indicators.
- Streaming Extraction:
- Change Data Capture (CDC): Captures real-time changes, often using logs. Ideal for database replication.
Key Extraction Considerations:
Before extracting data, ask:
- What’s the business value? Can the data be reused? Where will it be stored, and in what format? How often does it need updating? What’s the expected data volume and quality?
Technical Considerations:
- Bounded vs. Unbounded Data: Distinguishes between finite datasets (e.g., daily sales reports) and continuous data streams (e.g., sensor readings). This impacts processing strategies, with bounded data often suitable for batch processing and unbounded data requiring stream processing.
- Frequency (Processing Cadence): Defines how often data is processed:
- Batch Processing: Processes large datasets at scheduled intervals (e.g., daily, hourly).
- Micro-Batch Processing: Processes small batches at short intervals (e.g., every few seconds or minutes).
- Real-Time Processing: Processes data as it arrives, providing immediate results.
- Synchronous vs. Asynchronous Processing: Determines execution flow:
- Synchronous: Operations execute sequentially, waiting for each to complete before starting the next.
- Asynchronous: Operations execute concurrently or in parallel, improving responsiveness but requiring careful resource management.
- Serialization/Deserialization: Manages data format conversion for storage and transmission, impacting efficiency. Common formats include JSON, XML, and Protocol Buffers.
- Throughput/Scalability: Assesses system capacity and its ability to handle increasing data volumes. Scalability is often achieved through:
- Horizontal Scaling: Adding more machines.
- Vertical Scaling: Increasing resources (CPU, memory) of existing machines.
- Push/Pull/Poll Data Transfer: Describes how data is transferred:
- Push: Data source actively sends data to the processing system.
- Pull: Processing system requests data from the source.
- Poll: Processing system periodically checks the source for new data.
- Reliability/Durability: Ensures data integrity and persistence against failures through techniques like data replication, fault tolerance, and backups.
- Payload (Data Size and Structure): Refers to the size and organization of the data being processed, impacting network bandwidth, storage requirements, and processing efficiency.
Potential Challenges:
- Data quality issues, security breaches, data spikes, extraction interruptions, and API rate limits.