From 15c0c099978a1dd50f701994ac9e06670573fc83 Mon Sep 17 00:00:00 2001
From: Ian Gulliver <ian@flamingcow.tv>
Date: Mon, 15 Apr 2019 03:48:38 +0000
Subject: [PATCH] Safe(r) data changes

---
 2011-11-29-safer-data-changes.html        | 62 +++++++++++++++++++++++
 index.html                                |  1 +
 markdown/2011-11-29-safer-data-changes.md | 57 +++++++++++++++++++++
 markdown/index.md                         |  1 +
 4 files changed, 121 insertions(+)
 create mode 100644 2011-11-29-safer-data-changes.html
 create mode 100644 markdown/2011-11-29-safer-data-changes.md
diff --git a/2011-11-29-safer-data-changes.html b/2011-11-29-safer-data-changes.html
new file mode 100644
index 0000000..c289910
--- /dev/null
+++ b/2011-11-29-safer-data-changes.html
@@ -0,0 +1,62 @@
+<!--# set var="title" value="Safe(r) data changes" -->
+<!--# set var="date" value="November 29, 2011" -->
+
+<!--# include file="include/top.html" -->
+
+<p>It's a great goal to avoid making manual changes to your database. It never works 100%, though; there are always software bugs, unexpected interactions and other events that muck up your data, and you have to do one-off corrections. These are inherently hazardous: hard to test for unexpected data corruption, hard to apply consistently and hard to model the application behavior that results from them. Here are some strategies for the first issue, avoiding unexpected data corruption.</p>
+
+<ol>
+<li>Don't run one-off executables against your database. Instead, have the executable print out SQL that it would have run to update the database. If something goes wrong, you don't have to model the behavior of the program; you can just look at the SQL.</li>
+<li>Check the SQL files into source control somewhere. Manual changes tend to breed more manual changes to fix the fixes, so you never know when you'll want a record of what you twiddled in the past.</li>
+<li>Include all fields from the primary key in the WHERE clause. This ensures that each statement only modifies one row. Even if this results in a huge list of changes, at least you know exactly what changed.</li>
+<li>Include as many additional gating clauses as possible, linked with AND. For example, if you have a table of products and you want to set the price to 0.99 for everything that is currently set to 1.00, do:
+<code>
+UPDATE Products SET Price=0.99 WHERE ProductId=2762 AND Price=1.00;
+</code>
+This ensures that if something else changes Price just before you run your change, you don't destroy that update.</li>
+<li>Record the number of rows affected by each statement, in case something unexpected happens.</li>
+<li>Use transactions sensibly. Overly huge grouping of statements can block replication, but consider whether your changes will be toxic if partially applied.</li>
+<li>Stop running changes on errors or warnings and let a human examine the output. Warnings like string truncation can be a sign of a broken change.</li>
+</ol>
+
+<p>#4 can be a challenge if your verification statements are complex. Consider, for example, if you want to update rows in table A for which there is exactly one row for a particular CustomerId. It's relatively easy to do a SELECT statement to verify this by hand:</p>
+
+<pre><code>SELECT
+    CustomerId,
+    COUNT(CustomerId)
+  FROM A
+  WHERE
+    CustomerId IN (15, 16)
+  GROUP BY CustomerId;
+</code></pre>
+
+<p>To verify this at UPDATE time, however, you either need a subselect or an intermediate table. We'll use the latter:</p>
+
+<pre><code>CREATE TABLE ScratchTable AS
+  SELECT
+      CustomerId,
+      COUNT(CustomerId) AS Customers
+    FROM A
+    WHERE
+      CustomerId IN (15, 16)
+    GROUP BY CustomerId;
+
+UPDATE A
+  JOIN ScratchTable USING (CustomerId)
+  SET Updated=1
+  WHERE
+    A.Id=3
+    AND Customers=1;
+</code></pre>
+
+<p>The same trick works if your data change inserts new rows:</p>
+
+<pre><code>INSERT INTO A (CustomerId)
+  SELECT CustomerId
+    FROM ScratchTable
+    WHERE
+      CustomerId=15
+      AND Customers=1;
+</code></pre>
+
+<!--# include file="include/bottom.html" -->
diff --git a/index.html b/index.html
index 0f6a30f..a5dc2b0 100644
--- a/index.html
+++ b/index.html
@@ -20,6 +20,7 @@
 <li>2016-Feb-15: <a href="2016-02-15-cable-modem-channel-party.html">Cable modem channel party</a></li>
 <li>2016-Feb-01: <a href="2016-02-01-how-to-enrage-your-cable-modem.html">How to enrage your cable modem</a></li>
 <li>2016-Feb-01: <a href="2016-02-01-hall-of-2-4-ghz-shame-2016-edition.html">Hall of 2.4 GHz Shame, 2016 Edition</a></li>
+<li>2011-Nov-29: <a href="2011-11-29-safer-data-changes.html">Safe(r) data changes</a></li>
 <li>2011-Aug-09: <a href="2011-08-09-innodb-as-the-default-table-type.html">InnoDB as the default table type</a></li>
 <li>2011-Aug-08: <a href="2011-08-08-database-best-practices-for-future-scalability.html">Database best practices for future scalability</a></li>
 <li>2011-Jul-12: <a href="2011-07-12-converting-subselects-to-joins.html">Converting subselects to joins</a></li>
diff --git a/markdown/2011-11-29-safer-data-changes.md b/markdown/2011-11-29-safer-data-changes.md
new file mode 100644
index 0000000..fde32bc
--- /dev/null
+++ b/markdown/2011-11-29-safer-data-changes.md
@@ -0,0 +1,57 @@
+<!--# set var="title" value="Safe(r) data changes" -->
+<!--# set var="date" value="November 29, 2011" -->
+
+<!--# include file="include/top.html" -->
+
+It's a great goal to avoid making manual changes to your database. It never works 100%, though; there are always software bugs, unexpected interactions and other events that muck up your data, and you have to do one-off corrections. These are inherently hazardous: hard to test for unexpected data corruption, hard to apply consistently and hard to model the application behavior that results from them. Here are some strategies for the first issue, avoiding unexpected data corruption.
+
+1. Don't run one-off executables against your database. Instead, have the executable print out SQL that it would have run to update the database. If something goes wrong, you don't have to model the behavior of the program; you can just look at the SQL.
+1. Check the SQL files into source control somewhere. Manual changes tend to breed more manual changes to fix the fixes, so you never know when you'll want a record of what you twiddled in the past.
+1. Include all fields from the primary key in the WHERE clause. This ensures that each statement only modifies one row. Even if this results in a huge list of changes, at least you know exactly what changed.
+1. Include as many additional gating clauses as possible, linked with AND. For example, if you have a table of products and you want to set the price to 0.99 for everything that is currently set to 1.00, do:
+   ```
+   UPDATE Products SET Price=0.99 WHERE ProductId=2762 AND Price=1.00;
+   ```
+   This ensures that if something else changes Price just before you run your change, you don't destroy that update.
+1. Record the number of rows affected by each statement, in case something unexpected happens.
+1. Use transactions sensibly. Overly huge grouping of statements can block replication, but consider whether your changes will be toxic if partially applied.
+1. Stop running changes on errors or warnings and let a human examine the output. Warnings like string truncation can be a sign of a broken change.
+
+\#4 can be a challenge if your verification statements are complex. Consider, for example, if you want to update rows in table A for which there is exactly one row for a particular CustomerId. It's relatively easy to do a SELECT statement to verify this by hand:
+
+    SELECT
+        CustomerId,
+        COUNT(CustomerId)
+      FROM A
+      WHERE
+        CustomerId IN (15, 16)
+      GROUP BY CustomerId;
+
+To verify this at UPDATE time, however, you either need a subselect or an intermediate table. We'll use the latter:
+
+    CREATE TABLE ScratchTable AS
+      SELECT
+          CustomerId,
+          COUNT(CustomerId) AS Customers
+        FROM A
+        WHERE
+          CustomerId IN (15, 16)
+        GROUP BY CustomerId;
+
+    UPDATE A
+      JOIN ScratchTable USING (CustomerId)
+      SET Updated=1
+      WHERE
+        A.Id=3
+        AND Customers=1;
+
+The same trick works if your data change inserts new rows:
+
+    INSERT INTO A (CustomerId)
+      SELECT CustomerId
+        FROM ScratchTable
+        WHERE
+          CustomerId=15
+          AND Customers=1;
+
+<!--# include file="include/bottom.html" -->
diff --git a/markdown/index.md b/markdown/index.md
index fbb7a47..0225a9e 100644
--- a/markdown/index.md
+++ b/markdown/index.md
@@ -19,6 +19,7 @@
 1. 2016-Feb-15: [Cable modem channel party](2016-02-15-cable-modem-channel-party.html)
 1. 2016-Feb-01: [How to enrage your cable modem](2016-02-01-how-to-enrage-your-cable-modem.html)
 1. 2016-Feb-01: [Hall of 2.4 GHz Shame, 2016 Edition](2016-02-01-hall-of-2-4-ghz-shame-2016-edition.html)
+1. 2011-Nov-29: [Safe(r) data changes](2011-11-29-safer-data-changes.html)
 1. 2011-Aug-09: [InnoDB as the default table type](2011-08-09-innodb-as-the-default-table-type.html)
 1. 2011-Aug-08: [Database best practices for future scalability](2011-08-08-database-best-practices-for-future-scalability.html)
 1. 2011-Jul-12: [Converting subselects to joins](2011-07-12-converting-subselects-to-joins.html)