Wednesday, June 20, 2012

MapReduce illustrated

This is a great illustration of the MapReduce concept for anyone who tries to understand the algorithm intuitively.  I saw it at a Hadoop talk from Salesforce.com.



Basically it's a laundry operation that sorts socks first, then washes them with "like colors", of course.  :)  The sorting tables basically is the Map step processor, and the washers carry out the Reduce.  One important concept is that the Map usually uses a generic processor, which doesn't mind working on any subset of the data; on the other hand the Reduce step is usually data-specific, which in this example means, red washers only run with the red socks.  The whole operation is horizontally scalable in a near linear fashion, i.e., just add processing power (people and equipment - table or washer - in this case) to scale up the ability to handle larger volume.

Here's the original talk, with the MapReduce part run by Jed Crosby.

Force.com: The ugly state of job management from command line

I just couldn't do it in any reasonably good way - at least I haven't found one.  Unfortunately I often felt the need to do it.  For instance, after creating a new sandbox, one of the clean-up tasks would be to de-schedule those useless jobs so they won't generate unnecessary messages.  In any given production org there might be tens of, or even hundreds of jobs.  Do you want to click through them from the Web UI?  Naturally any admin would think about a script of some sort.  Here's the kick - I haven't been able to come up with one that meets the following criteria

  • Reasonably flexible
  • Reasonably powerful
  • Reasonably robust

This is a script I use currently to remove old jobs (mostly copied RupaliJ's post here). 

List<CronTrigger> cron = [SELECT id, CronExpression, TimesTriggered, NextFireTime FROM CronTrigger where nextfiretime=null];
for(CronTrigger CT: cron)
{
   try
   {
       System.abortjob(CT.id);
       system.debug(ct);
   } catch(Exception e1)
   {
       System.debug('--- Unable to delete scheduled job with Scheduled_Geocoder_Job_Id = ' + CT.id + ' because ' + e1.getMessage());
   }
}

Judging by the criteria, this falls short for the following reasons:
  • It cannot select job based on class used, job name, or type of job.  With those attributes not being supported by the CronTrigger object, the power of the script to meet different needs is next to none.
  • Since abortJob is considered a DML statement, this script will end after it has terminated 150 jobs.  Usually you bulkify your processing for most Apex DML operations, but since abortJob () doesn't support collections as argument, there's no way to actually do bulk operations.

The need for an efficient way has been made more urgent by this mysterious issue (also here), as you may have to delete/recreate some/all your scheduled jobs just to deploy a new class (itself not even a Schedulable).

If you have a better approach, I'm definitely all ears.

Wednesday, June 6, 2012

Chunking an Aggregate Query for SFDC batch jobs

One of the serious limits on the Force.com batchable platform is the strict requirement of having the Start () call finish in 2 minutes.  If you're dealing with a huge data set from a huge object, you might be in trouble.  The trick is to loosen the criteria and let the Execute() to further filter the data.  The batchable platform can handle 50 million records, but it's impatient if your scope query can't identify all of them in 2 minutes. 

It's esp. hard if you need to do stuff on an aggregated basis, i.e., the job scope is essentially a bunch of AggregateResults.  The reduction of query time becomes more complex.  Essentially the same trick still applies, but need additional support of your Stateful data structure.  I had an interesting discussion of this exact scenario on DeveloperForce and provided some sample code, check it out if you're interested.  Details in my posts ad Here-n-now.

Monday, June 4, 2012

Security is not a priority

It's about web site security, specifically this web site I use to register my son for his baseball activity.  There is not much security to speak of in fact.  You don't need account logins to get in, even though I do have an account/profile since I've registered before.  All I need is to type in my last name, then all families registered with the league with that last name are "conveniently" listed, complete with the street name they're on so there's less chance of mix up.  Then I can click my entry, and see my son's first name, then proceed to register, provide my payment info, and his medical insurance info.  Indeed it's a very low friction process, as long as you're not concerned about your private information being exposed.  In the end I don't know if I should be sorry for the company that provides the service, or for myself.  Apparently some people still live in a fantasy land where everyone is noble and friendly.

We're not talking about a local, mom and pop volunteer organization.  The service provider supplies the online database service to dozens of local youth sports organizations across the country, according to the proud customer list on the site.  To their credit, PayPal is offered as a payment option, which at least gives you some choice for lowering the risk. 

Sometimes you wonder how illicit activities online based on the theft of private information has grown to a business more than 50 billion dollars annually.  Maybe it's really not a wonder - the entry cost has been made very low in too many circumstances.