Pages: 1 2 3 4 5 6 7 8 9 10 11 ... 23 >>
Don't write your test suite to create and destroy databases for each run. Instead, make each test method start a transaction and roll it back. We just made that move at work on a DAL project, and the test suite went from 500+ seconds to run the whole thing down to around 100. It also allowed us to remove a lot of "undo" code in the tests.
This means ensuring your test helpers always connect to their databases on the same connection (transactions are connection-specific). If you're using a connection pool where leased conns are bound to each thread, this means rewriting tests that start new threads (or leaving them "the old way"; that is, create/drop). It also means that, rather than running slightly different .sql files per test or module, you instead have a base of data and allow each test to add other data as needed. If your rollbacks work, these can't pollute other tests.
Obviously, this is much harder if you're doing integration testing of sharded systems and the like. But for application logic, it'll save you a lot of headache to do this from the start.
Duncan McGreggor writes:
The Twisted source code was specifically designed to be read
(well, the code from the last two years, anyway).
If that were true, then this would not be ('object' graciously donated by me to the Twisted Foundation):
>>> from twisted.web import http
>>> http.HTTPChannel.mro()
[<class 'twisted.web.http.HTTPChannel'>,
<class 'twisted.protocols.basic.LineReceiver'>,
<class 'twisted.internet.protocol.Protocol'>,
<class 'twisted.internet.protocol.BaseProtocol'>,
<type 'object'>,
<class twisted.protocols.basic._PauseableMixin at 0x02ABCB70>,
<class twisted.protocols.policies.TimeoutMixin at 0x02ABC420>,
]
This wouldn't be true either:
$ grep -R "class I.*" /usr/lib/python2.5/site-packages/twisted | wc -l
287
Interfaces are great for development of a framework, but suck for development with a framework. That must be an older rev on my nix box; that number's grown to 380 in trunk! Not all of those are Interfaces, but most are.
Here's my personal favorite:
for tran in 'Generic TCP UNIX SSL UDP UNIXDatagram Multicast'.split():
for side in 'Server Client'.split():
if tran == "Multicast" and side == "Client":
continue
base = globals()['_Abstract'+side]
method = {'Generic': 'With'}.get(tran, tran)
doc = _doc[side]%vars()
klass = new.classobj(tran+side, (base,),
{'method': method, '__doc__': doc})
globals()[tran+side] = klass
You've got a tough row to hoe, Twisted devs. Good luck.
Marius Gedminas just wrote a post on memory leaks. He could have used Dowser to find the leak more easily, I'll bet.
Dowser is a CherryPy application for monitoring and managing object references in your Python program. Because CherryPy runs everything (even the listening HTTP socket) in its own threads, it's a snap to include Dowser in any Python process. Dowser is also very lightweight (because CherryPy is). Here's how I added it to a Twisted project we're using at work:
...
from twisted.application import service
application = service.Application("My Server")
s.setServiceParent(application)
import cherrypy
from misc import dowser
cherrypy.config.update({'server.socket_port': 8088})
cherrypy.tree.mount(dowser.Root())
cherrypy.engine.autoreload.unsubscribe()
# Windows only
cherrypy._console_control_handler.unsubscribe()
cherrypy.engine.start()
from twisted.internet import reactor
reactor.addSystemEventTrigger('after', 'shutdown', cherrypy.engine.exit)
The lines before 'import cherrypy' already existed and are here just for context (this is a Twisted service.tac module). Let's quickly discuss the new code:
Then browse to http://localhost:8088/ and you'll see pretty sparklines of all the objects. Change the URL to http://localhost:8088/?floor=20 to see graphs for only those objects which have 20 or more objects.

Then, just click on the 'TRACE' links to get lots more information about each object. See the Dowser wiki page for more details and screenshots.
First, a great aphorism from Zed's (Vellum book](http://www.zedshaw.com/projects/vellum/manual-final.pdf) (pdf):
Makefiles are the C programmer’s REPL and interpreter.
He also asks himself:
What’s the minimum syntax needed to describe a build specification?
I predict good things based on the presence of that question alone.
You have my permission to name your next test framework, library, or script "epic" and bill it as "more full of [FAIL] than any other test thingy".
Oh, and http://www.google.com/search?q=epic.py
/me looks in Titus' direction...
Chui's counterpoint pines:
There are some interesting ideas raised in LINQ that even Python developers ought to explore and consider adopting in a future Python.
Python had all this before LINQ in Dejavu and now Geniusql, and more pythonically, to boot. Instead of:
var AnIQueryable = from Customer in db.Customers where
Customer.FirstName.StartsWith("m") select Customer;
you can write:
m_names = Customer.select(
lambda cust: cust.FirstName.startswith("m"))
and instead of:
var AverageRuns =(from Master in this.db.Masters
select Master.Runs).Average()
you can write:
avgruns = Masters.select(lambda m: avg(m.Runs))
Divmod has a development methodology which they call UQDS. It's billed as lightweight, but I've been using it for 6 months now and find it burdensome. The basic flow of UQDS is: make a ticket, do all work in a branch, get a full review, merge to trunk. In theory, this brings the benefits of peer review and fewer conflicts between developers. In practice, however, I've found the following problems:
So, here's my answer:
The goals of XQDS:
The strategy of XQDS:
The flow of XQDS is:
svn switch.svn up on at least a daily basis, and definitely before committing. Conflicts are resolved as needed with each local copy.Questions:
Don't you lose the benefit of branching? What if Jethro checks in broken code and goes to lunch?
Don't you lose the benefit of review? Review helps avoid conflict and also teaches the reviewee.
svn up often, and help resolve conflicts.What if I need to share unfinished code with another developer? Or switch developers mid-feature? Or switch platforms mid-feature? Or switch features mid-developer?
svn switch and a simple folder rename when you want to put aside some work for a while.Isn't trunk broken more often?
UQDS says it improves information flow to managers. Don't you lose that with XQDS?
Doesn't XQDS require more conflict resolution?
svn switch to a new branch and commit your (now broken) changes. You're going to have to resolve the conflict either way; but XQDS allows you to skip making a branch unless you need it.One of the saving graces of Memcached is its use of a stable, static address space; that is, client X and client Y can each manipulate datum D as long as they both know its key. But because the space of addresses is static and flat (not hierarchical), it also tends to be huge, and sparse. This can make it difficult to perform set operations, such as invalidation of a class of entries.
For example, let's design a Data Access component which sits in front of a database, and accepts requests using an RPC-style interface. It fetches results from the cache where possible; otherwise, it reads from a database and writes the result to the cache before returning it to the caller. Assume we have multiple Data Access (DAL) servers, multiple Memcached servers, and one or more database servers.
A common set-invalidation scenario involves the caching of lists of items. Let's suppose a web server requests get_items_in_cart(cart_id=13, limit=10, offset=30). The DAL server might translate this into a cache key such as get_items_in_cart::13:10:30. So far so good. But that's just a read; when we add writes to the picture, cache coherency starts to become a problem.
When a webserver asks a DAL component to add an item to cart 13, we need to invalidate (or recalculate) get_items_in_cart::13:10:30. However, it should be readily apparent that we need to invalidate not just a single key, but potentially a whole class of keys, get_items_in_cart::id:limit:offset, and that that set of keys could be very large. Let's conservatively guess:
>>> max_items = 1000
>>> sum([(max_items / n) for n in range(5, 21)])
1507
That is, if we restrict the 'limit' argument to the interval [5, 20], and assume a maximum number of items per cart of 1024, we end up with over 1500 potential cache keys per cart for a single function! And if we have a million carts...? There are several ways we could attack this problem:
get_items_in_cart(cart_id, limit, offset), which has 3 arguments, we might instead always assume a page size of 10 and expose get_paged_items_in_cart(cart_id, page), which would result in only 100 potential cache keys per cart, not 1500. We might even go further and expose get_items_in_cart(cart_id) (1 cache key per cart) and let the caller do its own paging. This makes cache invalidation more performant (fewer keys to manage) at the cost of API simplicity and extensibility (because the API is less flexible). So your programmers will scream, possibly silently, when the product team decides they want to change the number of items on a page.get_items_in_cart(cart_id, limit, offset) is cached, the address is registered in a central list. When a class of keys needs to be invalidated, that central registry can then iterate over all seen keys for each class quickly and stably (or return the list so the caller can do it). This can reduce performance with the extra network traffic, is not very scalable, and tacks on a new reliability issue (the directory is a single point of failure, which costs even more performance if made redundant). Also, your system architects will probably leave you threatening notes.get_items_in_cart::13:10:30, we might hash it to get_items_in_cart::13/x, where 'x' is a number from 1 to 4. If we could come up with a reliable hash, we could then invalidate 4 keys, not 1500. This approach can become very complex, and results in cache keys which aren't very transparent (so getting other components to interact with the system is harder). But it doesn't require any extra network calls or new components; each DAL server does its own hash calculations. Your VP of Engineering may stick you with the bill for all those new engineers with Math Ph.D.'s, though.get_items_in_cart::13:10:30 for, say, 2 minutes, then tell memcached to expire that entry after 2 minutes, and don't bother trying to invalidate it on write. This is by far the simplest solution when you can swing it. But your system operators or QA team might call you first when they have to wait for one too many timeouts during a 3 A.M. debugging session.carts table, and use it when forming cache keys. When you update the row, update the version. This can increase the number of database hits if not done properly, but in this example, your callers most likely have already fetched the entire relevant carts row, and might be able to pass the version in the DAL call. However, don't be surprised when your DBA's tell you the HR DB crashed and they lost your timesheets for the past year.That's all I can think of at the moment. I think I've got some political goodwill among our other programmers at the moment (in that they don't wish me any specific harm, yet), so I may go with solution #1 for now on my current project. But I've also recently been playing architect, so maybe I'll pick solution #2 and just throw away any threatening notes I leave for myself.
ocean has rapidly become my favorite blogger. I even find myself reading nearly every one of his quick links. Here's a gem from a few days ago:
This mismatch between what a person wants, what a tool does, and what a person needs turns out to be very important. It's so important that it has a special name: complexity.
It seems lots of people are using memcached to cache both a set of objects (each with their own key), and also various lists of those objects using different keys. For example, a retailer might cache Item objects, but also want to cache the list of Items in a given Category. The SQL before memcached might look like this:
SELECT * FROM Item WHERE CategoryID = 5;
..whereas with memcached mixed in, the I/O tends to look like this (against an empty cache):
get Item:CategoryID=5
END
SELECT ID FROM Item WHERE CategoryID = 5;
set Item:CategoryID=5 1 300 19
1111,2222,3333,4444
STORED
get Item:1111
END
get Item:2222
END
get Item:3333
END
get Item:4444
END
SELECT * FROM Item WHERE ID IN (1111, 2222, 3333, 4444)
set Item:1111 1 300 58
STORED
set Item:2222 1 300 58
STORED
set Item:3333 1 300 54
STORED
set Item:4444 1 300 80
STORED
That is, fetch the list of ID's from the cache; if not present, fetch it from the DB and store it in the cache (the "300" in the above examples means, "expire in 300 seconds"). Then iterate over the list of ID's and try to fetch each whole object from cache; if any miss, ask the DB for them (in as few operations as possible) and store them in the cache.
Once both the objects and the list-of-id's are cached, subsequent calls to a hypothetical get_items_by_category function should look like this:
get Item:CategoryID=5
sending key Item:CategoryID=5
END
get Item:1111
sending key Item:1111
END
get Item:2222
sending key Item:2222
END
get Item:3333
sending key Item:3333
END
get Item:4444
sending key Item:4444
END
But what happens when you move Item 3333 from Category 5 to Category 6? There are three possibilities:
Item:Category=6 from the DB before the cached Item:Category=5 list expires.If you're happy with option 1, great! The rest of this discussion probably isn't for you. I'm going to explore three solutions (only one of which I'm happy with) for cases 2 and 3.
So you're not happy with the expiration time of your cached lists, so you've built all that invalidation code. What you may not realize is that you've just reinvented (badly) something databases have had for decades: indices.
An index is usually implemented with a B+-tree, most of the details of which are unimportant for us. What is important is that 1) an index covers a subset of the columns in the table (often a single column), which from now on I'm going to call the index criteria, and 2) each distinct combination of values for the index criteria has its own leaf node in the tree, which contains/points to a list of rows in the table that match that combination. What a mouthful. Another way to say it is that the leaf nodes in the tree look like this for an index over Item.CategoryID:
(2,): [9650, 2304, 22, 50888]
(3,): [323, 3000, 243246, 87346, 6563, 8679]
(5,): [1111, 2222, 3333, 4444]
(6,): [18]
When you ask the database for all Items with a CategoryID of 5, the database traverses the "CategoryID" index tree, finds the leaf node for "5", grabs the list stored at that node, then iterates over the list and yields each full object mentioned therein. This is called an "index scan":
# EXPLAIN SELECT * FROM Items WHERE CategoryID = 5;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Index Scan using items_categoryid on Items (cost=0.00..29.49 rows=609 width=24)
Index Cond: (CategoryID = 5)
Sound familiar? It's exactly what we're doing by hand with memcached.
Okay, it's not exactly like our memcached example. There are some striking differences.
First, a database index is sparse, in the sense that it doesn't contain leaf nodes for every potential index criteria value, just the concrete values in the table. Our memcached indexing is even sparser: so far it only contains leaf nodes (lists of ID's) for each index scan we've run. If we've only asked for Items in Categories 2 and 5, memcached will only contain nodes for Item:CategoryID=2 and Item:CategoryID=5.
Second, a database index is a full tree, with a root node. What we've done so far in memcached is only store leaf nodes. This will bite us in a moment.
Third, a database index is naturally transactional. When you move Item 3333 from Category 5 to 6, you might execute the SQL statement "UPDATE Item SET CategoryID = 6 WHERE ID = 3333;". The database, being the sole arbiter of Truth, can lock the row, read the old value, lock the index leaf nodes, remove the row from the old leaf node and add it to the new one, write the new value to the table, and unlock, all within a fully-logged transaction (although it can be a lot more complicated than that with ranged index locks, log schemes, and page reallocation schemes). Our memcached implementation so far can't do any of that.
Combining the above differences, we get...a real mess. Specifically, we have to find a way to do all of those index updates.
One way would be to invalidate everything. Call flush_all and be done with it. This can work (poorly) if you've partitioned your data well over multiple memcached servers, but Woe Unto Thee if you're storing, say, cached HTML on the same node.
Another, narrower, solution would be to try to delete all known cached lists for the given criteria. One way to do that would be to attempt to maintain the whole index tree in memcached, not just the leaf nodes. This turns out to be a thorny problem because of the transitory nature of memcached data--what happens when you lose an intermediate node in the index tree? or the root node? You'll fail to update the now-unreachable leaf nodes, but clients will still fetch them and get the stale results.
An even narrower solution would be to try to update just two index leaf nodes for each updated object. For example, if we move Item 3333 from Category 5 to 6, we could try to remove the Item from leaf node "5" and add it to leaf node "6". This can work, but requires keeping the old values around for all indexed columns in your application layer, which a lot of modern DALs and ORMs don't do by default.
I was stuck at this point for a couple of days, until I had an epiphany:
Recall if you will what I said above about the transactional nature of database indices. The DB can remove the row from the old index node(s) and add it to the new index node(s) in one step. We already know we can't do that with memcached, since the "get" and "set" operations aren't atomic anyway (although check-and-set can work around this; see 'cas' in protocol.txt.)
Transactions exist to maintain data integrity; to move The Data from one valid state to another valid state. But don't confuse the technique for the phenomena that technique is trying to prevent. In the case of indices, we use transactions on indices to avoid both 1) reading an object that does not meet the index criteria for the given node, or 2) failing to read an object that does meet the criteria. Databases avoid both scenarios by adding and removing rows from the index atomically.
When dealing with index nodes in memcached, however, the best approach is to separate the two phenomena by adding eagerly and removing lazily:
UPDATE Item SET CategoryID = 6 WHERE ID = 3333", you also cache the object (you were doing that already anyway), but you also append the ID to the list stored at Item:CategoryID=6.Item:CategoryID=5 do that when it iterates over the objects in the list. If any objects no longer meet the index node criteria (CategoryID=5), they are removed from the list, and the client sets the revised index node back into the cache.There are several:
table.column = value expression, it maps well, but using joins, boolean operators (like and, or, not) or other arithmetic operators makes it difficult. But then, the major databases have the same issues.I think this is worth a shot. I'm adding it to Dejavu in a branch, but it's complicated enough that it may not be done for a bit. Comments and suggestions on the approach welcome.