Tag: notes

Scala – installation behind a workplace web proxy

I’ve been learning Scala as part of my continuing professional development. Scala is a functional language which runs primarily on the Java Runtime Environment. It is a first class citizen for working with Apache Spark – an important platform for data science. My intention in learning Scala is to get myself thinking in a more functional programming style and to gain easy access to Java-based libraries and ecosystems, typically I program in Python.

In this post I describe how to get Scala installed and functioning on a workplace laptop, along with its dependency manager, sbt. The core issue here is that my laptop at work puts me behind a web proxy so that sbt does not Just Work™. I figure this is a common problem so I thought I’d write my experience down for the benefit of others, including my future self.

The test system in this case was a relatively recent (circa 2015) Windows 7 laptop, I like using bash as my shell on Windows rather than the Windows Command Prompt – I install this using the Git for Windows SDK.

Scala can be installed from the Scala website https://www.scala-lang.org/download/. For our purposes we will use the  Windows binaries since the sbt build tool requires additional configuration to work. Scala needs the Java JDK version 1.8 to install and the JAVA_HOME needs to point to the appropriate place. On my laptop this is:

JAVA_HOME=C:\Program Files (x86)\Java\jdk1.8.0_131

The Java version can be established using the command:

javac –version

My Scala version is 2.12.2, obtained using:

scala -version

Sbt is the dependency manager and build tool for Scala, it is a separate install from:

http://www.scala-sbt.org/0.13/docs/Setup.html

It is possible the PATH environment variable will need to be updated manually to include the sbt executables (:/c/Program Files (x86)/sbt/bin).

I am a big fan of Visual Studio Code, so I installed the Scala helper for Visual Studio Code:

https://marketplace.visualstudio.com/items?itemName=dragos.scala-lsp

This requires a modification to the sbt config file which is described here:

http://ensime.org/build_tools/sbt/

Then we can write a trivial Scala program like:

object HelloWorld {

 

def main(args: Array[String]): Unit = {

 

    println(“Hello, world!”)

 

  }

 

}

And run it at the commandline with:

scala first.scala

To use sbt in my workplace requires proxies to be configured. The symptom of a failure to do this is that the sbt compile command fails to download the appropriate dependencies on first run, as defined in a build.sbt file, producing a line in the log like this:

[error] Server access Error: Connection reset url=https://repo1.maven.org/maven2/net/
sourceforge/htmlcleaner/htmlcleaner/2.4/htmlcleaner-2.4.pom

In my case I established the appropriate proxy configuration from the Google Chrome browser:

chrome://net-internals/#proxy

This shows a link to the pacfile, something like:

http://pac.madeupbit.com/proxy.pac?p=somecode

The PAC file can be inspected to identify the required proxy, in my this case there is a statement towards the end of the pacfile which contains the URL and port required for the proxy:

if (url.substring(0, 5) == ‘http:’ || url.substring(0, 6) == ‘https:’ || url.substring(0, 3) == ‘ws:’ || url.substring(0, 4) == ‘wss:’)

{

return ‘PROXY longproxyhosturl.com :80’;

}

 

These are added to a SBT_OPTS environment variable which can either be set in a bash-like .profile file or using the Windows environment variable setup.

export SBT_OPTS=”-Dhttps.proxyHost=longproxyhosturl.com -Dhttps.proxyPort=80 -Dhttps.proxySet=true”

As a bonus, if you want to use Java’s Maven dependency management tool you can use the same proxy settings but put them in a MAVEN_OPTS environment variable.

Typically to start a new project in Scala one uses the sbt new command with a pointer to a g8 template, in my workplace this does not work as normally stated because it uses the github protocol which is blocked by default (it runs on port 9418). The normal new command in sbt looks like:

sbt new scala/scala-seed.g8

The workaround for this is to specify the g8 repo in full including the https prefix:

sbt new https://github.com/scala/scala-seed.g8

This should initialise a new project, creating a whole bunch of standard directories.

So far I’ve completed one small project in Scala. Having worked mainly in dynamically typed languages it was nice that, once I had properly defined my types and got my program to compile, it ran without obvious error. I was a bit surprised to find no standard CSV reading / writing library as there is for Python. My Python has become a little more functional as a result of my Scala programming, I’m now a bit more likely to map a function over a list rather than loop over the list explicitly.

I’ve been developing intensively in Python over the last couple of years, and this seems to have helped me in configuring my Scala environment in terms of getting to grips with module/packaging, dependency managers, automated doocumentation building and also in finding my test library (http://www.scalatest.org/) at an early stage.

The Logging module in Python

In the spirit of improving my software engineering practices I have been trying to make more use of the Python logging module. In common with many programmers my first instinct when debugging a programming problem is to use print statements (or their local equivalent) to provide an insight into what my program is up to. Obviously, I should be making use of any debugger provided but there is something reassuring about the immediacy and simplicity of print.

A useful evolution of the print statement in Python is the logging module which can be used as a simple print function but it can do so much more: you can configure loggers for different packages and modules whose behaviour can be controlled centrally; you can vary the verbosity of your logging messages. If you decide to switch to logging to a file rather than the terminal this can be achieved too, and you can even post your log messages to a website using HTTPhandler. Obviously logging is about much more than debugging.

I am writing this blog post because, as most of us have discovered, using logging is not quite as straightforward as we were led to believe. In particular you might find yourself in the situation where you feel you have set up your logging yet when you run your code nothing appears in your terminal window. Print doesn’t do this to you!

Loggers are arranged in a hierarchy. Loggers have handlers which are the things that cause a log to generate output to a device. If no log is specified then a default log called the root log is used. A logger has a name and the hierarchy is defined by the dots in the name, all the way “up” to the root logger. Any logger can have a handler attached to it, if no handler is attached then any log message is passed to the parent logger.

A log record has a message (the thing you would have printed) and a “level” which indicates the severity of the message these are specified by integers for which the logging module provides convenient labels. The levels in order of severity are logging.DEBUG, logging.INFO, logging.WARNING, logging.ERROR, logging.CRITICAL. A log handler will output a message if the level of the message is equal to or more than the level it has been set to. So a handler set to WARNING will show messages at the WARNING, ERROR and CRITICAL levels but not the INFO and DEBUG levels.

The simplest way to use the logging module is to import the library:

import logging

Then carry out some minimal configuration,

logging.basicConfig(level=logging.INFO)

and then put logging.info statements in our code, just as we would have done with print statements:

logging.info("This is a log message that takes a parameter = {}".format(a_parameter_value))

logging.debug, logging.warning, logging.error and logging.critical are used to publish log messages with different levels of severity. These are all convenience methods which remove the need to explicitly give the level as found in the logging.log function:

logging.log(logging.INFO, "This is a log message")

If we are writing a module, or other code that we anticipate others importing and running then we should create a logger using logging.getLogger(__name__) but leave configuring it to the caller. In this instance we use the name of the logger we have created instead of the module level “logging”. So to publish a message we would do:

logger = logging.getLogger(__name__)
logger.info("Hello")

In the module importing this library you would do something like:

import some_library
logging.basicConfig(level=logging.INFO)
# if you wanted to tweak the levels of another logger 
logger = logging.getLogger("some other logger")
logger.setLevel(logging.DEBUG)

basicConfig() configures the root logger which is where all messages end up in the absence of any other handler. The behaviour of logging.basicConfig() is downright obstructive at times. The core of the problem is that it can only be invoked once in a session, any future invocations are ignored. Worse than this it can be invoked implicitly. So if for example you do:

import logging
logging.warning("Hello")

You’ll see a message because secretly logging has effectively run logging.basicConfig(level=logging.WARNING) for you (or something similar). This means that if you were to then naively go ahead and run basicConfig yourself:

logging.basicConfig(level=logging.INFO)

You would see no message when you subsequently ran logging.info(“Hello”) because the “second” invocation of logging.basicConfig is ignored.

We can explicitly set the properties of the root logger by doing:

root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)

You can debug issues like this by checking the handlers to a logger. If you do:

import logging
lgr = logging.getLogger()
lgr.handlers

You get the empty list []. Issue a logging.warning() message and you see that a handler has been added to the root logger, lgr.handlers() returns something like [<logging.StreamHandler at 0x44327f0>].

If you want to see a list of all the loggers in the hierarchy then do:

logging.Logger.manager.loggerDict

So there you go, the logging module is great – you should use it instead of print. But beware of the odd behaviour of logging.basicConfig() which I’ve spent most of this post griping about. This is mainly so that I have all my knowledge of logging in one place rather than trying to remember which piece of code I pulled off a particular trick.

I used the logging documentation here, blog posts by Fang (here) and Praveen Gollakota (here) and tab completion in the ipython REPL in the preparation of this post.

Git–notes

logo@2xI’ve discovered that my blog is actually a good place to put things I need to remember see, for example, my blog post on running Ubuntu in a VM on Windows 8.

In this spirit here are my notes on using git, the distributed version control system (DVCS). These are things I picked up around the office at ScraperWiki, I wrote something there about the scheme we use for Git. This is more a compendium of useful git commands.

I use Git on both Windows and Ubuntu and I have accounts with both GitHub and Bitbucket. I’ve configured ssh on my Windows and Ubuntu machines and use that for authentication. I Windows I interact with Git using Git Bash.

Installation

On installing Git I do the following setup, obviously using my own name and email:

git config --global user.name "John Doe"
git config --global user.email [email protected]
git config --global core.editor vim

I can list my config settings using:

git config -l

Starting a repo

To start a new repo we do:

git init

These days I feel bereft if I’m not “pushing” my local repository to an online repository like GitHub or BitBucket. To add a remote repository create one using the service of your choice which will probably ask you to do:

git remote add origin [url]

Alternatively you can clone an existing repository into a subdirectory of your current directory with the name of the repo:

git clone [url]

This one clones into current directory, making a mess if that’s not what you intended!

git clone [url] .

A variant, if you are using a repo with submodules in it, :

git clone –recursive [url]

If you forgot to do the above on first cloning then you can do:

git submodule update –init

Adding and committing files

If you’ve started a new repository then need to add some files to track:

git add [filename]

You don’t have to commit all the changes you made since the last commit, you can select them using the -p option

git add –p

And commit them to the repository with a commit command like:

git commit –m [message]

Alternatively you can add the commit message in your favoured editor with the difference from previous commit shown below:

git commit –a –v

I tend to use an remote repository as a backup so I regularly do:

git push origin HEAD

If someone else is working on the same repository as you then things get more complicated but that’s out of the scope of this post.

Undoing things

If you get your commit message wrong you can edit it with:

git commit --amend

If you decide you change your mind about staging a file for commit:

git reset HEAD [filename]

If you change your mind about the modifications you have made to a file since the last commit then you can revert to the last commit using this **destructive** command:

git checkout -- [filename]

You should be careful doing that since it will obliterate any changes you’ve made to a file, even if you saved them from the editor.

Working out where you are

You can list files in the repo with:

git ls-tree --full-tree -r HEAD

The general command for seeing what is going on is:

git status

This tells you if you have made edits which have not been staged, which branch you are on and files which are not being tracked. Whilst you are working you can see the difference from the previous commit using:

git diff

If you’ve already added files to commit then you need to do:

git diff –cached

You can see a list of all your changes using:

git log

This command gives you more information, in a more compact form:

git log --oneline --graph --decorate

is a good way of seeing the status of your branch and the other branches in the repository. I have aliased this log set of options as:

git lg

To do this I added the following to my ~/.gitconfig file:

[alias]
  
        lg = log --oneline --graph --decorate

Once you’ve commited a bunch of changes you might want to push them to a remote server. This pushes to the remote called origin, and HEAD ensures you push to your current branch. HEAD is Git’s shorthand for the latest commit on the current branch:

git push origin HEAD

Branches

The proceeding commands are how you’d work using a single master branch, if you were working alone on something simple, for example. If you are working with other people or on something more complicated then you probably want to work on a branch, you can make a new branch by doing:

git checkout –b [branch name]

You can find out what other branches are available by doing:

git branch –v -a

Once you are on a branch you can commit changes, and push them onto your remote server, just as if you were on the master branch.

Merging and rebasing

The excitement comes when you want to merge your changes onto the master branch or you want to get changes on your own branch made by someone else and pushed to the remote reposition. The quick and dirty way to do this is using

git pull

This does a fetch and merge all at the same time. The better way is to fetch the changes and then merge them:

git fetch –prune –all
git merge origin/master

If you are working with someone else then you may prefer to merge changes onto the master branch by making a pull request on GitHub or BitBucket.

Accepting Pull Requests from Forks

If someone makes a Pull Request based on their forked copy of a repo then you can download for testing by doing:

git fetch origin pull/ID/head:BRANCHNAME