Skip to content

Commit

Permalink
Add new minimal documentation #8
Browse files Browse the repository at this point in the history
Signed-off-by: Philippe Ombredanne <[email protected]>
  • Loading branch information
pombredanne committed Sep 18, 2020
1 parent fc215d3 commit b67b96f
Show file tree
Hide file tree
Showing 7 changed files with 401 additions and 55 deletions.
8 changes: 8 additions & 0 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ On the offline install server:
1. extract the ScanCode.io code,
2. install dependencies
3. prepare the database

::

tar -xf scancodeio-1.0.1.tar.gz && cd scancode.io
Expand All @@ -29,6 +30,7 @@ Use as a development environment with::

SCANCODEIO_WORKSPACE_LOCATION=/path/to/scancodeio/workspace/ make run


Offline upgrade
---------------

Expand All @@ -49,9 +51,15 @@ On the offline install server:
2. extract the new ScanCode.io code
3. install dependencies
4. migrate the database

::

mv scancode.io scancode.io-$(date +"%Y-%m-%d_%H%M")
tar -xf scancodeio-1.0.1.tar.gz && cd scancode.io
make install
make migrate

Next Step
---------

- Getting started with Docker image analysis from the command line `scanpipe-tutorial-1.rst`.
77 changes: 77 additions & 0 deletions docs/introduction.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
Why ScanCode.io
===============

Modern software is built from many open source packages assembled with new code.
Knowing which free and open source code package is in use matters because:

- knowning the license of third-party code is required before using it, and
- you want to avoid using buggy, outdated or vulnerable components.

Because it is so easy to include and reuse new code downloaded from the internet,
it is often surprisingly hard to get a proper inventory of all the third-party
code origins and licenses used in a software project.
There are some great tools available to scan your code and help uncover these.

And when you reuse only a few FOSS components in a single project, running one
of these tools (such as the scancode-toolkit) by hand together
with a spreadsheet may be enough to manage your software composition analysis.

But when you scale up, running automated and reproducible analysis pipelines
that are adapted to a software project unique context and technology platform is
difficult. This will require deploying and running multiple specialized tools
and merge their results with a consistent workflow.

And when reusing thousands of open source packages is becoming commonplace,
code scans pipelines need to be scripted as code and running on servers backed
a database, not on a laptop.

For instance when you analyze Docker container images, there could be hundreds
to thousands of system packages (such as Debian, RPM, Alpine) and application
packages (such as npm, PyPI, Rubygems, Maven) installed in an image side-by-side
with your own code.

Taking care of all these can be hard. ScanCode.io can help organize these
complex code analysis as scripted pipelines and store their results in a uniform
database for automated code analysis.


What is ScanPipe
----------------

ScanPipe is a developer-friendly framework and application that helps software
analysts and engineers build and manage real-life software composition analysis
projects as scripted pipelines.

ScanPipe was originally developed to help boost productivity of code analysts
who work on a wide variety of software composition analysis projects.

ScanPipe provides a unified framework to the infrastructure that is
required to execute and organize these software composition analysis projects.


## Should I Use ScanPipe?

If you are working on a software composition analysis project, or you
are planning to start a new one, consider the following questions:

1. **Automation**: Is this project part of a larger compliance program and process (as opposed to a one-of) and do you need automation?
2. **Complexity**: Does the project use many third-party components or technologies?
3. **Reproducibility**: Is it important that results are reproducible, traceable and auditable?

If you answered "yes" to any of the above, keep reading - ScanPipe can help you.
If the answer is "no" to all of the above, which is a valid scenario e.g. when you
are doing small-scale analysis, ScanPipe may provide only limited benefit for you.

The first set of available pipelines help automate the analysis of Docker
"container" images and virtual machine (VM) disk images that often harbor
comprehensive software stacks from an operating system with its kernel through
system and application packages to original and custom applications.


Next step
---------

- Install ScanCode.io `installation.rst`.

.. Some of this documentation is borrowed from the metaflow documentation and is also under Apache-2.0
.. Copyright (c) Netflix
19 changes: 19 additions & 0 deletions docs/scanpipe-api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
ScanPipe JSON REST API
======================


To get started locally with the API:

1. run the server with::

make run

2. open your web browser at http://127.0.0.1:8001/

3. Visit the projects APIT endpoint at http://127.0.0.1:8001/api/projects/

From the bottom of this page you can create a new project, add and upload an input
file and add a pipeline to this project at once.

If you add a pipeline, the pipeline starts immediately on project creation.

126 changes: 126 additions & 0 deletions docs/scanpipe-command-line.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
ScanPipe Commands help
======================

The main entry point is the `scanpipe` command which is available directly when
you are in the activated virtualenv or directly at this path: `<scancode.io root dir/bin/scanpipe>` .


`$ scanpipe --help`
-------------------

List all the sub-commands available (including Django built-in commands).
ScanPipe's own commands are listed under the `[scanpipe]` section.

For example::

$ scanpipe --help
...
[scanpipe]
add-input
add-pipeline
create-project
graph
output
run
...


`$ scanpipe <subcommand> --help`
--------------------------------

Display help for the provided subcommand.

For example::

$ scanpipe create-project --help
usage: scanpipe create-project [-h] [--pipeline PIPELINES] [--input INPUTS]
[--version] [-v {0,1,2,3}]
[--settings SETTINGS] [--pythonpath PYTHONPATH]
[--traceback] [--no-color] [--force-color]
[--skip-checks]
name
Create a ScanPipe project.
positional arguments:
name Project name.


`$ scanpipe create-project <name>`
----------------------------------

Create a ScanPipe project using <name> as a Project name. The name must
be unique.

optional arguments:

- `--pipeline PIPELINES` Pipelines locations to add on the project. The
pipelines are added and will be running in the order of the provided options.

- `--input INPUTS` Input file locations to copy in the input/ workspace directory.


`$ scanpipe add-input --project PROJECT <input ...>`
----------------------------------------------------

Copy the file found at the <input> path to the project named <PROJECT> workspace
"input" directory. You can use more than one <input> to copy multiple files at once.

For example, assuming you have created beforehand a project named foo, this will
copy `~/docker/alpine-base.tar` to the foo project input directory::

$ scanpipe add-input --project foo ~/docker/alpine-base.tar


`$ scanpipe add-pipeline --project PROJECT <pipeline ...>`
----------------------------------------------------------

Add the <pipeline> foudn at this location to the project named <PROJECT>.
You can use more than one <pipeline> to add multiple pipelines at once.
The pipelines are added and will be running in the order of the provided options.

For example, assuming you have created beforehand a project named foo, this will
add the docker pipeline to your project::

$ scanpipe add-pipeline --project foo scanpipe/piplines/docker.py


`$ scanpipe run --project PROJECT`
----------------------------------

Run all the pipelines of the project named <PROJECT>.


`$ scanpipe run --project PROJECT --show`
-----------------------------------------

List all the pipelines added of the project named <PROJECT>.



`$ scanpipe output --project PROJECT <output_file>`
---------------------------------------------------

Output the results of the project named <PROJECT> to the <output_file> as JSON.



`$ scanpipe graph <pipeline ...>`
---------------------------------

Generate a pipeline graph image as PNG (using Graphviz). The graphic will name
after the pipeline name with a .png extension.

optional arguments:

- `--output OUTPUT` Alternative output directory location to use. The
default is to create the image in the scancode.io root directory.


Next step
---------

- Explore ScanPipe Concepts `scanpipe-concepts.rst`.



88 changes: 88 additions & 0 deletions docs/scanpipe-concepts.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
ScanPipe Concepts
=================

Project
-------

A project is the encapsulates the analysis of software code:

- it has a workspace which is a directory that contains the software code files under analysis
- it is related to one or more code analysis pipelines scripts to automate its analysis
- it tracks the project Codebase Resources e.g. its code files and directories
- it tracks the project Discovered Packages e.g. its the system and application packages origin and license discovered in the codebase

Multiple analysis pipelines can be run on a single project.

In the database, a project is identified by its unique name.


Project workspace
-----------------

A project workspace is the root directory where all the project files are stored.

The following directories exists under this directory:

- `input/` contains all the original uploaded and input files used of the project. For instance, it could be a codebase archive.
- `codebase/` contains the files and directories (aka. resources) tracked as CodebaseResource records in the database.
- `output/` contains all output files created by the pipelines: reports, scan results, etc.
- `tmp/` is a scratch pad for temporary files generated during the pipelines runs.


Pipelines
---------

A pipeline is a Python script that contains a series of steps from start to end
to run in order perform a code analysis.

It usually starts from the uploaded input files, and may extract these then
generates CodebaseResource records in the database accordingly.

Those resources can then be analyzed, scanned, matched as needed.
Analysis results and reports are evetually posted at the end of pipeline run

For now, all pipelines are located in the `scanpipe.pipelines` module.
Each pipeline consist of a Python script including one subclass of the "Pipeline" class.
Each step is a method of the Pipeline class decorated with @step decorator.
At its end, a step states which is the next step to execute.

One or more pipelines can be assigned to a project as a sequence.
If the one pipeline of a sequence completes successfully, the next pipeline in
queue for this project is run automatically until all pipelines are executed.


Codebase Resources
------------------

A project Codebase Resources are records of its code files and directories.
CodebaseResource is a database model and each record is identified by its path
under the project workspace.

Some of the CodebaseResource interesting attributes are:

- a status used to track the analysis status for this resource.
- a type (such as file, directory or symlink)
- various attributes to track detected copyrights, license expressions, copyright holders, related packages.

In general the attributes and their names are the same that are used in ScanCode-Toolkit for files.


Discovered Packages
-------------------

A project Discovered Packages are records of the system and application packages
discovered in its code.
DiscoveredPackage is a database model and each record is identified by its Package URL.
Package URL is a grassroot efforts to create informative identifiers for software
packages such as Debian, RPM, npm, Maven PyPI packages. See https:/package-url for details.


Some of the DiscoveredPackage interesting attributes are:

- type, name, version (all Package URL attributes)
- homepage_url, download_url and other URLs
- checksums (such as SHA1, MD5)
- copyright, license_expression, declared_license


In general the attributes and their names are the same that are used in ScanCode-Toolkit for packages.
Loading

0 comments on commit b67b96f

Please sign in to comment.