SYSTEM_IDENTITY

ABOUT_DERPDATA

DerpData is a fake data platform built for developers who need outputs that survive a human glance test. This is a solo project, built and maintained by an indie developer, with a focus on API-first workflows, schema-driven generation, and practical privacy tooling.

WHY_MARKOV

MARKOV_CHAINS_OVER_STATIC_LISTS

Static lists and rigid templates are fast, but they leak patterns. You see repeated phrasing, flat text rhythm, and combinations that look synthetic. DerpData uses Markov chains for text-heavy generators so output follows learned token transitions from domain corpora.

That means generated names, addresses, company strings, product blurbs, and medical-style notes are not copied from a dictionary row. They are sampled from probabilistic sequence states. The result is varied text that stays statistically plausible and less templated.

USE_CASES

WHAT_THIS_SOLVES

  • Dev and QA teams can run realistic staging tests with data that behaves like production records, without exposing production users.
  • Privacy programs can mask sensitive datasets for GDPR and HIPAA workflows while keeping schema shape and business logic intact.
  • ML teams can create PII-free training and evaluation datasets where text fields still look natural enough for model behavior testing.

FAQ

Is this free to use?

Yes. There is a free tier for manual generation and basic API usage. Higher limits, team features, and production workflows are in paid plans.

What makes DerpData different from Mockaroo or Faker.js?

Most generators are list-driven or template-driven. DerpData also uses Markov models for text-heavy fields, so names, addresses, notes, and descriptions follow realistic token transitions instead of repeating obvious patterns.

Can I use this for ML training data?

Yes, if your pipeline needs realistic structure without real PII. Teams use DerpData to build synthetic corpora for training, evaluation, and red-team tests before touching regulated data.

Does the API have rate limits?

Yes. Limits depend on plan and endpoint. Anonymous traffic is limited more aggressively, and high-volume workflows should use API keys plus batching strategies.

Can I self-host this?

Today, the managed version is the primary product. Self-hosting is possible in controlled environments by request when teams need private deployment constraints.

What formats does export support?

JSON, CSV, SQL, XML, YAML, and JSONL are supported depending on endpoint. Schema Builder and masking workflows expose the common formats directly in the UI.

Is the generated data truly random or does it follow patterns?

It follows statistical patterns on purpose. Pure random output looks fake to people and systems. DerpData uses controlled randomness with corpus-informed probabilities so results stay varied and plausible.

How does the Markov engine work?

The engine tokenizes training corpora and builds transition probabilities for token sequences. Generation then walks that state graph with weighted sampling, producing text that mirrors distribution and flow without copying source records.