Home Data Handling with Python
Post
Cancel

Data Handling with Python

DISCLAIMER

  • This is theory of investment strategies and presented in this example for educational purpose only. This is not intended to be, and should not be considered as, financial advice or an invitation to invest real money. Trading and investing in financial markets involve significant risks, and you should consult with a qualified financial advisor before making any investment decisions.

  • The education institution and I are not responsible for any actions taken based on the educational content of this example, including but not limited to financial losses or other types of damages. Any application of the techniques for real-world investing is at the individual’s own risk and discretion. I will not be held liable for any misuse of the techniques or information provided, including any financial losses incurred. Always perform your own due diligence and consult professionals when considering any financial investments. The only purpose is to demonstrate the workflow process.

Why I am using Python for data analysis?

  • (1) reason: Data handling is about organizing data, data with its methodological and technical aspects, affects all types of collection, recording, and analysis of data.

  • (2) reason: Using Python and dealing with data require Python skills, library support, and know-how about data.

  • (3) reason: Data is about learning move data into tabular from which can be analyzed and graphed. Companies collect a lot of data, by itself can not be analyzed, and it needs to be transformed to extract insights.

Step-by-step process for handling data in Python, from ingestion to deployment. I’ll note common tools and short examples so you can adapt based on your use case.

1) Set up your environment

  • Create a virtual environment and install core libraries:
  • Data wrangling: pandas, pyarrow
  • Numeric: numpy
  • Visualization: matplotlib, seaborn, plotly
  • Validation: pandera or pydantic
  • Storage/IO: sqlalchemy (databases), requests (HTTP), fsspec (cloud files)
  • Performance (optional): polars, duckdb

2) Ingest data

  • Local files:
  • CSV/TSV: pandas.read_csv(“file.csv”)
  • Parquet: pandas.read_parquet(“file.parquet”, engine=”pyarrow”)
  • Excel: pandas.read_excel(“file.xlsx”, sheet_name=0)
  • Databases:

    Use SQLAlchemy: pandas.read_sql(“SELECT …”, con=engine)

  • APIs/HTTP:

    Use requests to fetch, then pd.DataFrame.from_records(json)

3) Validate inputs early

  • Quick checks:

    df.info(), df.head(), df.shape, df.describe(include=”all”) assert not df.duplicated(subset=[“id”]).any()

  • Schema with pandera:

    Define column types, ranges, nullability; run schema.validate(df) Handle encoding, types, and missing files with try/except and explicit dtype mapping

4) Clean and standardize

  • Column names: lowercase, snake_case
  • Types: df.astype({…}) or pd.to_datetime, pd.to_numeric
  • Missing values:

    Drop rows/cols: df.dropna(…) Impute: df.fillna({…}) or use statistics Duplicates: df.drop_duplicates(subset=[…]) Whitespace/casing: df[“col”] = df[“col”].str.strip().str.lower() Outliers: clip or winsorize depending on context

5) Transform and enrich

  • Filtering: df.query(“amount > 0 and status == ‘active’”)
  • Feature engineering: create new columns from existing ones
  • Grouping/aggregations: df.groupby([…]).agg({…}).reset_index()
  • Joins: df.merge(df2, on=”key”, how=”left”)
  • Reshaping: melt/pivot, wide ↔ long
  • Date handling: use dt accessors (df[“date”].dt.month)

6) Handle large data (performance)

Even Excel is a good tools for hanling data, when I have to process more 200 mb, it performances decrease drastically. So Python is the best alternative.

  • Chunked reading: for chunk in pd.read_csv(…, chunksize=100_000): …
  • Use efficient formats: Parquet with compression=”snappy” or “zstd”
  • Vectorize operations; avoid Python loops
  • Consider polars or duckdb for bigger-than-RAM workflows

7) Exploratory analysis and visualization

  • Summary stats: df.describe(), value_counts()
  • Plots:

    matplotlib/seaborn for static (hist, box, line, bar) plotly for interactive exploration

  • Check distributions, correlations (df.corr(numeric_only=True))

8) Persist results

  • Files:

    df.to_parquet(“data.parquet”, index=False) df.to_csv(“data.csv”, index=False) • Databases: > df.to_sql(“table_name”, con=engine, if_exists=”replace/append”, index=False) • Versioned data: > Include timestamps or commit hashes in filenames/paths

9) Reproducibility and structure

  • Use a clear project layout:

    data/raw, data/processed, notebooks/, src/, tests/

  • Configuration:

    Store paths/params in a config file (YAML, TOML) or env vars

  • Notebooks for exploration; .py modules for production logic
  • Set random seeds when using stochastic steps

10) Logging and error handling

  • Use the logging module with INFO/WARN/ERROR levels
  • Wrap IO and key transforms with try/except and informative messages
  • Validate assumptions with asserts or pandera

11) Testing and data quality

  • Unit tests for transforms (pytest)
  • Golden-data tests: compare outputs to known-small fixtures
  • Data quality checks: duplicates, ranges, referential integrity, null thresholds

12) Automation and orchestration

  • Make tasks idempotent (safe to rerun)
  • Use a task runner: make, tox, or Python scripts with CLI (argparse/typer)
  • For pipelines: Airflow, Prefect, Dagster (optional for larger teams)

13) Documentation

  • README with setup and run instructions
  • Docstrings for functions; example inputs/outputs
  • Data dictionary describing fields, types, and semantics

14) Security and compliance

  • Never hardcode secrets; use env vars or a secret manager
  • PII handling: mask/anonymize, access controls
  • Validate external inputs; sanitize file names/paths

15) Deployment options

  • Package the pipeline as a CLI (typer/click)
  • Containerize (Docker) if needed
  • Schedule via cron or a workflow orchestrator
  • Monitor runs and alert on failures or data anomalies

Example (illustrative):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
mport pandas as pd
from sqlalchemy import create_engine

# 1) Ingest
df = pd.read_csv("data/raw/sales.csv", dtype={"id": "int64"}, parse_dates=["order_date"])

# 2) Validate (quick)
assert df["id"].notna().all()
assert (df["amount"] >= 0).all()

# 3) Clean
df.columns = df.columns.str.strip().str.lower()
df["customer"] = df["customer"].str.strip()
df = df.drop_duplicates(subset=["id"])

# 4) Transform
df["year"] = df["order_date"].dt.year
agg = (
    df.query("status == 'completed'")
      .groupby(["year", "region"])
      .agg(total_amount=("amount", "sum"), orders=("id", "nunique"))
      .reset_index()
)

# 5) Persist
agg.to_parquet("data/processed/sales_by_year_region.parquet", index=False)

# 6) Load to DB (optional)
engine = create_engine("postgresql+psycopg2://:@localhost:5432/analytics")
# agg.to_sql("sales_by_year_region", engine, if_exists="replace", index=False)


Feedback.

Have a question? Feel free to send an email or if you prefer a virtual meeting

This post is licensed under CC BY 4.0 by the author.