Thursday, October 5, 2017

Release 1.2.0 of nifi-script-tester

I've just released version 1.2.0 of the nifi-script-tester, a utility that lets you test your Groovy, Jython, and Javascript scripts for use in the NiFi ExecuteScript processor.  Here are the new features:

- Upgraded code to NiFi 1.4.0
- Added support for incoming flow file attributes

For the first point, there was a lot of refactor done in the NiFi Scripting NAR in order to reuse code across various scripting components in NiFi, such as processors, controller services, record readers/writers, and reporting tasks.  Getting the codebase up to date will allow me to add new features such as the ability to test RecordReader/Writer scripts, ScriptedReportingTask scripts, etc.

For the second point, I'd been asked to add that support for a while so I finally got around to it :) There is now a new switch "attrfile" that lets you point to a Java properties file. These properties will be added to each flow file (whether coming from STDIN or the inputdir switch). In the future I hope to add support for Expression Language, and perhaps figure out a way to specify a set of attributes per incoming flow file (rather than reusing one set for all, whether that set supports EL or not). The code is Apache-licensed and on GitHub, I welcome any pull requests :)

Here is the new usage info:


Usage: java -jar nifi-script-tester-<version>-all.jar [options] <script file>
 Where options may include:
   -success            Output information about flow files that were transferred to the success relationship. Defaults to true
   -failure            Output information about flow files that were transferred to the failure relationship. Defaults to false
   -no-success         Do not output information about flow files that were transferred to the success relationship. Defaults to false
   -content            Output flow file contents. Defaults to false
   -attrs              Output flow file attributes. Defaults to false
   -all-rels           Output information about flow files that were transferred to any relationship. Defaults to false
   -all                Output content, attributes, etc. about flow files that were transferred to any relationship. Defaults to false
   -input=<directory>  Send each file in the specified directory as a flow file to the script
   -modules=<paths>    Comma-separated list of paths (files or directories) containing script modules/JARs
   -attrfile=<paths>   Path to a properties file specifying attributes to add to incoming flow files.

If anyone gives this a try, please let me know how/if it works for you. As always, I welcome all questions, comments, and suggestions.  Cheers!

Monday, June 5, 2017

InvokeScriptedProcessor template (a faster ExecuteScript)

For quick, easy, and small scripting tasks in Apache NiFi, ExecuteScript is often a better choice than InvokeScriptedProcessor, as there is little-to-no boilerplate code, relationships and properties are already defined and supported, and some objects relevant to the NiFi API (such as the ProcessSession, ProcessContext, and ComponentLog) are already bound to the script engine as variables that can readily be used by the script.

However, one tradeoff is performance; in ExecuteScript, the script is evaluated each time onTrigger is executed. With InvokeScriptedProcessor, as long as the script (or any of the InvokeScriptedProcessor properties) is not changed, the scripted Processor instance is maintained by the processor, and its methods are simply invoked when parent methods such as onTrigger() are called by the NiFi framework.

To get the best of both worlds, I have put together an InvokeScriptedProcessor instance that is configured the same way ExecuteScript is. The "success" and "failure" relationships are provided, the API objects are available, and if you simply paste your ExecuteScript code into the same spot in the below script, it will behave like a more performant ExecuteScript instance.  The code is as follows:

////////////////////////////////////////////////////////////
// imports go here
////////////////////////////////////////////////////////////

class E{ void executeScript(session, context, log, REL_SUCCESS, REL_FAILURE) 
    {
        ////////////////////////////////////////////////////////////
        // your code goes here
        ////////////////////////////////////////////////////////////
    }
}

class GroovyProcessor implements Processor {
    def REL_SUCCESS = new Relationship.Builder().name("success").description('FlowFiles that were successfully processed are routed here').build()
    def REL_FAILURE = new Relationship.Builder().name("failure").description('FlowFiles that were not successfully processed are routed here').build()
    def ComponentLog log
    def e = new E()   
    void initialize(ProcessorInitializationContext context) { log = context.logger }
    Set<Relationship> getRelationships() { return [REL_FAILURE, REL_SUCCESS] as Set }
    Collection<ValidationResult> validate(ValidationContext context) { null }
    PropertyDescriptor getPropertyDescriptor(String name) { null }
    void onPropertyModified(PropertyDescriptor descriptor, String oldValue, String newValue) { }
    List<PropertyDescriptor> getPropertyDescriptors() { null }
    String getIdentifier() { null }    
    void onTrigger(ProcessContext context, ProcessSessionFactory sessionFactory) throws ProcessException {
        def session = sessionFactory.createSession()
        try {
            e.executeScript(session, context, log, REL_SUCCESS, REL_FAILURE)
            session.commit()
        } catch (final Throwable t) {
            log.error('{} failed to process due to {}; rolling back session', [this, t] as Object[])
            session.rollback(true)
            throw t
}}}
processor = new GroovyProcessor()


The boilerplate Processor implementation is at the bottom, and I've left comment blocks where your imports and code go. With some simple cut-and-paste, you should be able to have a pre-evaluated Processor instance that will run your ExecuteScript code faster than before!

If you give this a try, please let me know how/if it works for you. I am always open to suggestions, improvements, comments, and questions.  Cheers!

Tuesday, March 14, 2017

NiFi ExecuteScript Cookbook

Hello All!  Just wanted to write a quick post here to let you know about a series of articles I have written about ExecuteScript support for Apache NiFi, with discussions of how to do various tasks, and examples in various supported scripting languages. I posted them on Hortonworks Community Connection (HCC).  Full disclosure: I am a Hortonworks employee :)


Lua and ExecuteScript in NiFi (revisited)

I recently fielded a question about using Lua (actually, LuaJ) in NiFi's ExecuteScript processor to manipulate flow files. I had written a basic article on using LuaJ with ExecuteScript, but that example only shows how to create new flow files, it does not address accepting incoming flow files or manipulating the data.

To rectify that I answered the question with the following example script:

flowFile = session:get()
if flowFile == nil then
  return
end

local writecb =
luajava.createProxy("org.apache.nifi.processor.io.StreamCallback", {
    process = function(inputStream, outputStream)
      local isr = luajava.newInstance('java.io.InputStreamReader', inputStream)
      local br = luajava.newInstance('java.io.BufferedReader', isr)
      local line = br:readLine()
      while line ~= nil do
         -- Do stuff to each line here
         outputStream:write(line:reverse())
         line = br:readLine()
         if line ~= nil then
           outputStream:write('\n')
         end
      end
    end
})
flowFile = session:putAttribute(flowFile, "lua.attrib", "my attribute value")
flowFile = session:write(flowFile, writecb)
session:transfer(flowFile, REL_SUCCESS)

Readers of my last LuaJ post will recognize the approach, using luajava.createProxy() to basically create an anonymous class instance of a NiFi Callback class, then providing (aka "overriding") the requisite interface method (in this case, the "process" method).

The first difference here is that I'm using the StreamCallback class instead of the OutputStreamCallback class in my previous example. You may recall that OutputStreamCallback only allows you to write to a flow file, whereas StreamCallback is for overwriting existing flow file content, by making both the input stream of the current version of the flow file, as well as the output stream for the next version of the flow file, available in the process() method.

You may also recall that often my scripting examples use Apache Commons' IOUtils class to read the
entire flow file content in as a string, then manipulate it after the fact. However LuaJ has a bug where it only uses the system classloader and thus won't have access to the additional classes provided to the scripting NAR.  So for this example I am wrapping the incoming InputStream in an InputStreamReader, then a BufferedReader so I can proceed line-by-line.  I reverse each line, and if there are lines remaining, I add the newline back to the output stream.

If you are simply reading in the content of a flow file, and won't be overwriting that flow file content, you can use InputStreamCallback instead of StreamCallback. InputStreamCallback's process() method gives you only the input stream of the incoming flow file. Some people use a session.read() with an InputStreamCallback to handle the incoming flow file(s), then later use a session.write() with an OutputStreamCallback, rather than a single session.write() with a StreamCallback as is shown above. The common use case for the alternate approach is for when you send the original flow file to an "original" relationship, but also write out new flow files based on the content of the incoming one(s). Many "Split" processors do this.

Anyway, I hope this example is informative and shows how to use Lua scripting in NiFi to perform custom logic.  As always, I welcome all comments, questions, and suggestions.  Cheers!

Friday, January 6, 2017

Inspecting the NAR classloading hierarchy

I've noticed on the NiFi mailing lists and in various places that users sometimes attempt to modify their NiFi installations by adding JARs to the lib/ folder, adding various custom and/or external NARs that don't come with the NiFi distribution, etc.  This can sometimes lead to issues with classloading, which is often difficult for a user to debug. If the same changes are not made across a NiFi cluster, more trouble can ensue.

For this reason, it might be helpful to understand the way NARs are loaded in NiFi. When a NAR is loaded by NiFi, a NarClassLoader is created for it. A NarClassLoader is an URLClassLoader that contains all the JAR dependencies needed by that NAR, such as third-party libraries, NiFi utilities, etc.  If the NAR definition includes a parent NAR, then the NarClassLoader's parent is the NarClassLoader for the parent NAR.  This allows all NARs with the same parent to have access to the same classes, which alleviates certain classloader issues when talking between NARs / utilities. One pervasive example is the specification of an "API NAR" such as "nifi-standard-services-api-nar", which enables the child NARs to use the same API classes/interfaces.

All NARs (and all child ClassLoaders in Java) have the following class loaders in their parent chain (listed from top to bottom):
  1. Bootstrap class loader
  2. Extensions class loader
  3. System class loader

You can consult the Wiki page for Java ClassLoader for more information on these class loaders, but in the NiFi context just know that the System class loader (aka Application ClassLoader) includes all the JARs from the lib/ folder (but not the lib/bootstrap folder) under the NiFi distribution directory.

To help in debugging classloader issues, either on a standalone node or a cluster, I wrote a simple flow using ExecuteScript with Groovy to send out a flow file per NAR, whose contents include the classloader chain (including which JARs belong to which URLClassLoader) in the form:
<classloader_object>
     <path_to_jar_file>
     <path_to_jar_file>
     <path_to_jar_file>
     ...
<classloader_object>
     <path_to_jar_file>
     <path_to_jar_file>
     <path_to_jar_file>
     ...

The classloaders are listed from top to bottom, so the first will always be the extensions classloader, followed by the system classloader, etc.  The NarClassLoader for the given NAR will be at the bottom.

The script is as follows:

import java.net.URLClassLoader
import org.apache.nifi.nar.NarClassLoaders

NarClassLoaders.instance.extensionClassLoaders.each { c ->

def chain = []
while(c) {
  chain << c
  c = c.parent
}

def flowFile = session.create()
flowFile = session.write(flowFile, {outputStream ->
  chain.reverseEach { cl ->
    outputStream.write("${cl.toString()}\n".bytes)
    if(cl instanceof URLClassLoader) {
      cl.getURLs().each {
        outputStream.write("\t${it.toString()}\n".bytes)
      }
    }
  }
} as OutputStreamCallback)
session.transfer(flowFile, REL_SUCCESS)
}

The script iterates over all the "Extension Class Loaders" (aka the classloader for each NAR), builds a chain of classloaders starting with the child and adding all the parents, then iterates the list in reverse, printing the classloader object name followed by a tab-indented list of any URLs (JARs, e.g.) included in the classloader.

This can be used in a NiFi flow, perhaps using LogAttribute or PutFile to display the results of each NAR's classloader hierarchy.

Note that these are the classloaders that correspond to a NAR, not the classloaders that belong to instances of processors packaged in the NAR.  For runtime information about the classloader chain associated with a processor instance, I will tackle that in another blog post :)

Please let me know if you find this useful, As always suggestions, questions, and improvements are welcome.  Cheers!

Wednesday, August 10, 2016

Executing Remote Commands in NiFi with ExecuteScript, Groovy, and Sshoogr

There are already some processors in Apache NiFi for executing commands, such as ExecuteProcess and ExecuteStreamCommand. These allow execution of remote scripts by calling the operating system's "ssh" command with various parameters (such as what remote command(s) to execute when the SSH session is established). Often this is accomplished by writing a bash script with a number of ssh commands (or a single ssh command with a long list of remote commands at the end).

As a different approach, you can use ExecuteScript or InvokeScriptedProcessor, along with a third-party library for executing remote commands with SSH. For this post, I will be using Groovy as the scripting language and Sshoogr, which is a very handy SSH DSL for Groovy.

Sshoogr can be installed via SDK Manager or downloaded (from Maven Central for example). The key is to have all the libraries (Sshoogr itself and its dependencies) in a directory somewhere. Then you point to the directory in ExecuteScript's Module Directory property, and those JARs will be available to the Groovy script.

In order to make this example fairly flexible, I chose a few conventions. First, the commands to be executed are expected to be on individual lines of an incoming flow file. Next, the configuration parameters (such as hostname, username, password, and port) are added to ExecuteScript as dynamic properties. Note that the password property is not sensitive in this case; if you want that, use InvokeScriptedProcessor and explicitly add the properties (specifying password as sensitive).

For Sshoogr to work (at least for this example), it is expected that the RSA key for the remote node is in the NiFi user's ~/.ssh/known_hosts file. Because of OS and other differences, sometimes the SSH connection will fail due to strict host key checking, so in the script we will disable that in Sshoogr.

Now to the script.  We start with the standard code for requiring an input flow file, and then use session.read() to get each line into a list called commands:
def flowFile = session.get()
if(!flowFile) return

def commands = []
// Read in one command per line
session.read(flowFile, {inputStream ->
    inputStream.eachLine { line -> commands << line  }
} as InputStreamCallback)
Then we disable strict host key checking:
options.trustUnknownHosts = true
Then we use the Sshoogr DSL to execute the commands from the flow file:
def result = null
remoteSession {

   host = sshHostname.value
   username = sshUsername.value
   password = sshPassword.value
   port = Integer.parseInt(sshPort.value)
        
   result = exec(showOutput: true, failOnError: false, command: commands.join('\n'))
}
Here we are using the aforementioned ExecuteScript dynamic properties (sshHostname, sshUsername, sshPassword, sshPort). Later in the post, a screenshot below shows them after having been added to the ExecuteScript configuration.

The exec() method is the powerhouse of Sshoogr. In this case we want the output of the commands (to put in an outgoing flowfile). Also I've set "failOnError" to false just so all commands will be attempted, but it is likely that you will want this to be true such that if a command fails, none of the following commands will be attempted. The final parameter in this example is "command". If you don't use the "named-parameter" (aka Map) version of exec() and just want to execute the list of commands (without setting "showOutput" for example), then exec() will take a Collection:
exec(commands)
In our case though, the exec() method will turn the "command" parameter into a string, so instead of just passing in the List "commands", we need to turn it back into a line-by-line string, so we use join('\n').

The next part of the script is to overwrite the incoming flow file, or if we could've created a new one with the original as a parent. If using ExecuteScript and creating a new flow file, be sure to remove the original. If using InvokeScriptedProcessor, you could define an "original" relationship and route the incoming flow file to that.
flowFile = session.write(flowFile, { outputStream ->
    result?.output?.eachLine { line ->
       outputStream.write(line.bytes)
       outputStream.write('\n'.bytes)
    }
} as OutputStreamCallback)
Finally, using result.exitStatus, we determine whether to route the flow file to success or failure:
session.transfer(flowFile, result?.exitStatus ? REL_FAILURE : REL_SUCCESS)
The full script for this example is as follows:
import static com.aestasit.infrastructure.ssh.DefaultSsh.*

def flowFile = session.get()
if(!flowFile) return

def commands = []
// Read in one command per line
session.read(flowFile, {inputStream ->
    inputStream.eachLine { line -> commands << line  }
} as InputStreamCallback)

options.trustUnknownHosts = true
def result = null
remoteSession {

   host = sshHostname.value
   username = sshUsername.value
   password = sshPassword.value
   port = Integer.parseInt(sshPort.value)
        
   result = exec(showOutput: true, failOnError: false, command: commands.join('\n'))
}

flowFile = session.write(flowFile, { outputStream ->
    result?.output?.eachLine { line ->
       outputStream.write(line.bytes)
       outputStream.write('\n'.bytes)
    }
} as OutputStreamCallback)

session.transfer(flowFile, result?.exitStatus ? REL_FAILURE : REL_SUCCESS)
Moving now to the ExecuteScript configuration:

Here I am connecting to my Hortonworks Data Platform (HDP) Sandbox to run Hadoop commands and such, the default settings are shown and you can see I've pasted the script into Script Body, and my Module Directory is pointing at my Sshoogr download folder. I added this to a simple flow that generates a flow file, fills it with commands, runs them remotely, then logs the output returned from the session:

For this example I ran the following two commands:
hadoop fs -ls /user
echo "Hello World!" > here_i_am.txt
In the log (after LogAttribute runs with payload included), I see the following:

And on the sandbox I see the file has been created with the expected content:

As you can see, this post described how to use ExecuteScript with Groovy and Sshoogr to execute remote commands coming from flow files. I created a Gist of the template, but it was created with an beta version of Apache NiFi 1.0 so it may not load into earlier versions (0.x). However all the ingredients are in this post so it should be easy to create your own flow, please let me know how/if it works for you :)

Cheers!

Thursday, June 9, 2016

Testing ExecuteScript processor scripts

I've been getting lots of questions about how to develop/debug scripts that go into the ExecuteScript processor in NiFi. One way to do this is to add a unit test to the nifi-scripting-processors submodule, and set the Script File property to your test script. However for this you basically need the full NiFi source.

To make things easier, I basically took a pared-down copy of the ExecuteScript processor (and its helper classes), added nifi-mock as a dependency, and slapped a command-line interface on it. This way with a single JAR you can run your script inside a dummy flow containing ExecuteScript.

The result is version 1.1.1 of the NiFi Script Tester utility, on GitHub and Bintray.

Basically it runs your script as a little unit test, and you can pipe stdin to become a flowfile, or point it at a directory and it will send every file as a flowfile, stuff like that. If your script doesn't need a flowfile, you can run it without specifying an input dir or piping in stdin, it will run once even without input.  The usage is as follows:
Usage: java -jar nifi-script-tester-<version>-all.jar [options] <script file>
 Where options may include:
   -success            Output information about flow files that were transferred to the success relationship. Defaults to true
   -failure            Output information about flow files that were transferred to the failure relationship. Defaults to false
   -no-success         Do not output information about flow files that were transferred to the success relationship. Defaults to false
   -content            Output flow file contents. Defaults to false
   -attrs              Output flow file attributes. Defaults to false
   -all-rels           Output information about flow files that were transferred to any relationship. Defaults to false
   -all                Output content, attributes, etc. about flow files that were transferred to any relationship. Defaults to false
   -input=<directory>  Send each file in the specified directory as a flow file to the script
   -modules=<paths>    Comma-separated list of paths (files or directories) containing script modules/JARs

As a basic example, let's say I am in the nifi-script-tester Git repo, and I've built using "gradle shadowJar" so my JAR is in build/libs. I can run one of the basic unit tests like so:
java -jar build/libs/nifi-script-tester-1.1.1-all.jar src/test/resources/test_basic.js
Which gives the following output:
2016-06-09 14:20:49,787 INFO  [pool-1-thread-1]                nifi.script.ExecuteScript - ExecuteScript[id=9a507773-86ae-4c33-957f-1c0270302a0e] hello
Flow Files transferred to success: 0
This shows minimal output, just the logging from the framework and the default summary statistic of "Flow Files transferred to success".  Instead let's try a more comprehensive example, where I pipe in a JSON file, run my Groovy script from the JSON-to-JSON conversion post, and display all the attributes and contents and statistics from the run:
cat src/test/resources/input_files/jolt.json | java -jar build/libs/nifi-script-tester-1.1.1-all.jar -all src/test/resources/test_json2json.groovy
This gives the following (much more verbose) output:
Flow file FlowFile[0,14636536169108_translated.json,283B]
---------------------------------------------------------
FlowFile Attributes
Key: 'entryDate'
	Value: 'Thu Jun 09 14:24:50 EDT 2016'
Key: 'lineageStartDate'
	Value: 'Thu Jun 09 14:24:50 EDT 2016'
Key: 'fileSize'
	Value: '283'
FlowFile Attribute Map Content
Key: 'path'
	Value: 'target'
Key: 'filename'
	Value: '14636536169108_translated.json'
Key: 'uuid'
	Value: '84d8d290-fbf9-4d57-aaf7-fd050da40d9f'
---------------------------------------------------------
{
    "Range": 5,
    "Rating": "3",
    "SecondaryRatings": {
        "metric": {
            "Id": "metric",
            "Range": 5,
            "Value": 6
        },
        "quality": {
            "Id": "quality",
            "Range": 5,
            "Value": 3
        }
    }
}

Flow Files transferred to success: 1

Flow Files transferred to failure: 0
There are options to suppress or select various things such as relationships, attributes, flowfile contents, etc.  As a final example let's look at the Hazelcast example from my previous post, to see how to add paths (files and directories) to the script tester, in the same way you'd set the Module Path property in the ExecuteScript processor:

java -jar ~/git/nifi-script-tester/build/libs/nifi-script-tester-1.1.1-all.jar -attrs -modules=/Users/mburgess/Downloads/hazelcast-3.6/lib hazelcast.groovy
And the output:

Jun 09, 2016 2:31:01 PM com.hazelcast.core.LifecycleService
INFO: HazelcastClient[hz.client_0_dev][3.6] is STARTING
Jun 09, 2016 2:31:01 PM com.hazelcast.core.LifecycleService
INFO: HazelcastClient[hz.client_0_dev][3.6] is STARTED
Jun 09, 2016 2:31:01 PM com.hazelcast.core.LifecycleService
INFO: HazelcastClient[hz.client_0_dev][3.6] is CLIENT_CONNECTED
Jun 09, 2016 2:31:01 PM com.hazelcast.client.spi.impl.ClientMembershipListener
INFO:

Members [1] {
	Member [172.17.0.2]:5701
}

Jun 09, 2016 2:31:01 PM com.hazelcast.core.LifecycleService
INFO: HazelcastClient[hz.client_0_dev][3.6] is SHUTTING_DOWN
Jun 09, 2016 2:31:01 PM com.hazelcast.core.LifecycleService
INFO: HazelcastClient[hz.client_0_dev][3.6] is SHUTDOWN
Flow file FlowFile[0,15008187914245.mockFlowFile,0B]
---------------------------------------------------------
FlowFile Attributes
Key: 'entryDate'
	Value: 'Thu Jun 09 14:31:01 EDT 2016'
Key: 'lineageStartDate'
	Value: 'Thu Jun 09 14:31:01 EDT 2016'
Key: 'fileSize'
	Value: '0'
FlowFile Attribute Map Content
Key: 'path'
	Value: 'target'
Key: 'hazelcast.customers.nifi'
	Value: '[name:Apache NiFi, email:nifi@apache.org, blog:nifi.apache.org]'
Key: 'filename'
	Value: '15008187914245.mockFlowFile'
Key: 'hazelcast.customers.mattyb149'
	Value: '[name:Matt Burgess, email:mattyb149@gmail.com, blog:funnifi.blogspot.com]'
Key: 'uuid'
	Value: '0a7c788e-0aef-40e4-b9a6-426f877dbfbe'
---------------------------------------------------------

Flow Files transferred to success: 1
This script would not work without the Hazelcast JARs, so this shows how the "-modules" option is used to add them to the classpath for testing.

The nifi-script-tester only supports Javascript and Groovy at the moment; including Jython (for example) would increase the JAR's size by 500% :/ Right now its only 8.6 MB, so a little big but not too bad.

Anyway hope this helps, please let me know how/if it works for you!

Cheers,
Matt