Loading ...
Sorry, an error occurred while loading the content.

etl tools (particularly DataStage) supporting agile practices

Expand Messages
  • martin_j_andrews
    I posted this message to the extremeprogramming group, and someone suggested I try here as well. I ve recently joined the data warehouse group in a banking
    Message 1 of 6 , Mar 20 4:36 AM
    • 0 Attachment
      I posted this message to the extremeprogramming group, and someone
      suggested I try here as well.

      I've recently joined the data warehouse group in a banking
      organisation as an agile coach. My background is mostly web
      application development in Java and Ruby, so the technical domain is
      quite new to me. The group is currently evaluating some new ETL tools
      for use in its work.

      Yesterday, I sat in a demonstration from IBM of its ETL tool called
      DataStage. It's supposed to be a market leader in this space. I
      asked some simple (or so I thought) questions about unit/integration
      testing, source control & continuous integration. The answers turned
      out to be remarkably bad.

      Source control was the worst. The tool uses a custom repository of
      code on a server. No local copies. Pessimistic locking on all
      'jobs'. No version history at all. You can see the date and user
      that created the job, and the date and user of last change, but that's
      it. Any time a job is saved, all previous changes are lost.

      Automated unit testing was almost as bad. Everything seems to be GUI
      driven in the fat client app, which makes automation rather difficult.
      It seems quite difficult to break up the pieces of a job for unit
      tests, and even a little bit tricky to stub in different data sources
      for integration tests.

      Those two issues combined make continuous integration practically
      impossible.

      I was flabbergasted at how badly this turned out, particularly in
      relation to source control.

      Does anyone have some experience with other ETL that differs? Is
      there some viable alternatives in this space that would support the
      agility of development better? Maybe I'm even giving DataStage a poor
      review because I don't understand the 'normal' development style?

      -- Marty Andrews
    • Adrian Walker
      Hi Marty -- You wrote... *I**s there some viable alternative in this space that would support the agility of development better?* There is some emerging
      Message 2 of 6 , Mar 26 12:22 PM
      • 0 Attachment
        Hi Marty --

        You wrote...

        *I**s there some viable alternative in this space that would support the
        agility of development better?*

        There is some emerging technology online at the site below.

        It's a kind of Wiki for specifying database applications, including ETL, as
        business rules in open vocabulary, executable English. Shared use is free,
        by pointing browser to the site below.

        From the rules, the system automatically generates and runs SQL that would
        be too complex to write reliably by hand. It can explain the results, in
        English, at the business level.

        A simple example is

        www.reengineeringllc.com/demo_agents/ETL0.agent

        Some background is in:


        www.reengineeringllc.com/A_Wiki_for_Business_Rules_in_Open_Vocabulary_Executable_English.pdf


        www.reengineeringllc.com/Oil_Industry_Supply_Chain_by_Kowalski_and_Walker.pdf

        www.reengineeringllc.com/ibldrugdbdemo1.htm (Flash video with audio)

        Apologies if you have seen this before, and thanks for comments.

        -- Adrian

        Internet Business Logic
        A Wiki and SOA Endpoint for Executable Open Vocabulary English over SQL
        Online at www.reengineeringllc.com Shared use is free

        Adrian Walker
        Reengineering





        On Thu, Mar 20, 2008 at 7:36 AM, martin_j_andrews <marty@...>
        wrote:

        > I posted this message to the extremeprogramming group, and someone
        > suggested I try here as well.
        >
        > I've recently joined the data warehouse group in a banking
        > organisation as an agile coach. My background is mostly web
        > application development in Java and Ruby, so the technical domain is
        > quite new to me. The group is currently evaluating some new ETL tools
        > for use in its work.
        >
        > Yesterday, I sat in a demonstration from IBM of its ETL tool called
        > DataStage. It's supposed to be a market leader in this space. I
        > asked some simple (or so I thought) questions about unit/integration
        > testing, source control & continuous integration. The answers turned
        > out to be remarkably bad.
        >
        > Source control was the worst. The tool uses a custom repository of
        > code on a server. No local copies. Pessimistic locking on all
        > 'jobs'. No version history at all. You can see the date and user
        > that created the job, and the date and user of last change, but that's
        > it. Any time a job is saved, all previous changes are lost.
        >
        > Automated unit testing was almost as bad. Everything seems to be GUI
        > driven in the fat client app, which makes automation rather difficult.
        > It seems quite difficult to break up the pieces of a job for unit
        > tests, and even a little bit tricky to stub in different data sources
        > for integration tests.
        >
        > Those two issues combined make continuous integration practically
        > impossible.
        >
        > I was flabbergasted at how badly this turned out, particularly in
        > relation to source control.
        >
        > Does anyone have some experience with other ETL that differs? Is
        > there some viable alternatives in this space that would support the
        > agility of development better? Maybe I'm even giving DataStage a poor
        > review because I don't understand the 'normal' development style?
        >
        > -- Marty Andrews
        >
        >
        >


        [Non-text portions of this message have been removed]
      • John Griffin
        Hi Marty, I ve had similar observations about BI tools generally with regard to source control integration and agile methodologies. You ll find the repository
        Message 3 of 6 , Mar 27 6:06 PM
        • 0 Attachment
          Hi Marty,
          I've had similar observations about BI tools generally with regard to source
          control integration and agile methodologies. You'll find the repository
          model, providing code migration from dev to qa to prod internally to the
          tool (so to speak) is common to many of them, sometimes w/ internal
          versioning options.

          That to the side, I like informatica powercenter for etl. Its got a
          respectable and growing install base w/ top tier companies including
          financial services, it works, and it integrates nicely w/ source control.
          All objects can be exported as xml, and any operation you'd want to perform
          on the metadata repository (ie export) can be scripted. I've gone much of
          the way in automating powercenter integration w/ source control and it was
          painless. There are some gotchas around redundancies in the objects with
          which informatica will export certain properties, but those are identifiable
          and manageable.

          let me know if you have any questions and I'll do my best to answer. sorry
          to respond so late after your request

          John Griffin
          john.g.griffin@...

          On Thu, Mar 20, 2008 at 7:36 AM, martin_j_andrews <marty@...>
          wrote:

          > I posted this message to the extremeprogramming group, and someone
          > suggested I try here as well.
          >
          > I've recently joined the data warehouse group in a banking
          > organisation as an agile coach. My background is mostly web
          > application development in Java and Ruby, so the technical domain is
          > quite new to me. The group is currently evaluating some new ETL tools
          > for use in its work.
          >
          > Yesterday, I sat in a demonstration from IBM of its ETL tool called
          > DataStage. It's supposed to be a market leader in this space. I
          > asked some simple (or so I thought) questions about unit/integration
          > testing, source control & continuous integration. The answers turned
          > out to be remarkably bad.
          >
          > Source control was the worst. The tool uses a custom repository of
          > code on a server. No local copies. Pessimistic locking on all
          > 'jobs'. No version history at all. You can see the date and user
          > that created the job, and the date and user of last change, but that's
          > it. Any time a job is saved, all previous changes are lost.
          >
          > Automated unit testing was almost as bad. Everything seems to be GUI
          > driven in the fat client app, which makes automation rather difficult.
          > It seems quite difficult to break up the pieces of a job for unit
          > tests, and even a little bit tricky to stub in different data sources
          > for integration tests.
          >
          > Those two issues combined make continuous integration practically
          > impossible.
          >
          > I was flabbergasted at how badly this turned out, particularly in
          > relation to source control.
          >
          > Does anyone have some experience with other ETL that differs? Is
          > there some viable alternatives in this space that would support the
          > agility of development better? Maybe I'm even giving DataStage a poor
          > review because I don't understand the 'normal' development style?
          >
          > -- Marty Andrews
          >
          >
          >


          [Non-text portions of this message have been removed]
        • Pritpal Sahota
          Hi Marty, There are a couple of open source options: 1. Pentaho data integration suite - You can export the ETL mappings in xml and use any external code
          Message 4 of 6 , Apr 7, 2008
          • 0 Attachment
            Hi Marty,

            There are a couple of open source options:

            1. Pentaho data integration suite - You can export the ETL mappings in xml and use any external code repository. There are some issue, we were able to automate with scripting.
            2. Talend - You can export ETL mappings and store in external code repository.

            Commercial option:
            3. Microsoft SSIS - You can export ETL mappings and store in external code repository.

            I've worked extensively with these tools. Any one of these tool will work fine for external code repository.

            I've also worked with Informatica, and I do agree with info provided by John.

            Thank You,

            Pritpal



            John Griffin <john.g.griffin@...> wrote:
            Hi Marty,
            I've had similar observations about BI tools generally with regard to source
            control integration and agile methodologies. You'll find the repository
            model, providing code migration from dev to qa to prod internally to the
            tool (so to speak) is common to many of them, sometimes w/ internal
            versioning options.

            That to the side, I like informatica powercenter for etl. Its got a
            respectable and growing install base w/ top tier companies including
            financial services, it works, and it integrates nicely w/ source control.
            All objects can be exported as xml, and any operation you'd want to perform
            on the metadata repository (ie export) can be scripted. I've gone much of
            the way in automating powercenter integration w/ source control and it was
            painless. There are some gotchas around redundancies in the objects with
            which informatica will export certain properties, but those are identifiable
            and manageable.

            let me know if you have any questions and I'll do my best to answer. sorry
            to respond so late after your request

            John Griffin
            john.g.griffin@...

            On Thu, Mar 20, 2008 at 7:36 AM, martin_j_andrews <marty@...>
            wrote:

            > I posted this message to the extremeprogramming group, and someone
            > suggested I try here as well.
            >
            > I've recently joined the data warehouse group in a banking
            > organisation as an agile coach. My background is mostly web
            > application development in Java and Ruby, so the technical domain is
            > quite new to me. The group is currently evaluating some new ETL tools
            > for use in its work.
            >
            > Yesterday, I sat in a demonstration from IBM of its ETL tool called
            > DataStage. It's supposed to be a market leader in this space. I
            > asked some simple (or so I thought) questions about unit/integration
            > testing, source control & continuous integration. The answers turned
            > out to be remarkably bad.
            >
            > Source control was the worst. The tool uses a custom repository of
            > code on a server. No local copies. Pessimistic locking on all
            > 'jobs'. No version history at all. You can see the date and user
            > that created the job, and the date and user of last change, but that's
            > it. Any time a job is saved, all previous changes are lost.
            >
            > Automated unit testing was almost as bad. Everything seems to be GUI
            > driven in the fat client app, which makes automation rather difficult.
            > It seems quite difficult to break up the pieces of a job for unit
            > tests, and even a little bit tricky to stub in different data sources
            > for integration tests.
            >
            > Those two issues combined make continuous integration practically
            > impossible.
            >
            > I was flabbergasted at how badly this turned out, particularly in
            > relation to source control.
            >
            > Does anyone have some experience with other ETL that differs? Is
            > there some viable alternatives in this space that would support the
            > agility of development better? Maybe I'm even giving DataStage a poor
            > review because I don't understand the 'normal' development style?
            >
            > -- Marty Andrews
            >
            >
            >

            [Non-text portions of this message have been removed]






            [Non-text portions of this message have been removed]
          • Adrian Mowat
            Hi Martin, Have you looked at Ab Initio? I don t want to start a flame war in which ETL tool is best, but Ab Initio does a much better job of source control
            Message 5 of 6 , Apr 8, 2008
            • 0 Attachment
              Hi Martin,

              Have you looked at Ab Initio?

              I don't want to start a flame war in which ETL tool is best, but Ab Initio
              does a much better job of source control than the description above.

              Furthermore, the FIT4Data project http://code.google.com/p/fit4data/ aims to
              improve test support for data management/ETL projects. We developed for use
              on Ab Initio projects - although there is nothing to prevent it being used
              with other tools. I am working on better docs and some new features at the
              moment and I am planning to put out a more formal release announcement when
              I am done, but since you have asked the question, I could run you through
              the basics in an email or, even better, a Skype call.

              Hope this helps

              Adrian



              On 20/03/2008, martin_j_andrews <marty@...> wrote:
              >
              > I posted this message to the extremeprogramming group, and someone
              > suggested I try here as well.
              >
              > I've recently joined the data warehouse group in a banking
              > organisation as an agile coach. My background is mostly web
              > application development in Java and Ruby, so the technical domain is
              > quite new to me. The group is currently evaluating some new ETL tools
              > for use in its work.
              >
              > Yesterday, I sat in a demonstration from IBM of its ETL tool called
              > DataStage. It's supposed to be a market leader in this space. I
              > asked some simple (or so I thought) questions about unit/integration
              > testing, source control & continuous integration. The answers turned
              > out to be remarkably bad.
              >
              > Source control was the worst. The tool uses a custom repository of
              > code on a server. No local copies. Pessimistic locking on all
              > 'jobs'. No version history at all. You can see the date and user
              > that created the job, and the date and user of last change, but that's
              > it. Any time a job is saved, all previous changes are lost.
              >
              > Automated unit testing was almost as bad. Everything seems to be GUI
              > driven in the fat client app, which makes automation rather difficult.
              > It seems quite difficult to break up the pieces of a job for unit
              > tests, and even a little bit tricky to stub in different data sources
              > for integration tests.
              >
              > Those two issues combined make continuous integration practically
              > impossible.
              >
              > I was flabbergasted at how badly this turned out, particularly in
              > relation to source control.
              >
              > Does anyone have some experience with other ETL that differs? Is
              > there some viable alternatives in this space that would support the
              > agility of development better? Maybe I'm even giving DataStage a poor
              > review because I don't understand the 'normal' development style?
              >
              > -- Marty Andrews
              >
              >
              >


              [Non-text portions of this message have been removed]
            • bill.mccrosky
              Hi Pritpal: I have been looking at Pentaho. It looks like it would be very adaptable to agile data techniques. It is written in Java and developed in
              Message 6 of 6 , Apr 16, 2008
              • 0 Attachment
                Hi Pritpal:

                I have been looking at Pentaho. It looks like it would be very
                adaptable to agile data techniques. It is written in Java and
                developed in Eclipse. The fact that the entire product line - ETL to
                ODS and to DMs - is written in Java should make END-TO-END TDD easier
                to accomplish. End-to-end is very important. Most discussions I have
                seen to date focus on the ETL portion of the chain. While this is
                some ways is the most important part of the chain (certainly the most
                complex), the business users have no visibility into ETL. They could
                care less until they see the results from a DM. The TDD testing has
                to ensure that the user is getting what he needs. Pentaho seems to
                support this testing the best. What do you think?

                Bill

                --- In agileDatabases@yahoogroups.com, Pritpal Sahota
                <pritpal_sahota@...> wrote:
                >
                > Hi Marty,
                >
                > There are a couple of open source options:
                >
                > 1. Pentaho data integration suite - You can export the ETL
                mappings in xml and use any external code repository. There are some
                issue, we were able to automate with scripting.
                > 2. Talend - You can export ETL mappings and store in external code
                repository.
                >
                > Commercial option:
                > 3. Microsoft SSIS - You can export ETL mappings and store in
                external code repository.
                >
                > I've worked extensively with these tools. Any one of these tool
                will work fine for external code repository.
                >
                > I've also worked with Informatica, and I do agree with info
                provided by John.
                >
                > Thank You,
                >
                > Pritpal
                >
                >
                >
                > John Griffin <john.g.griffin@...> wrote:
                > Hi Marty,
                > I've had similar observations about BI tools generally with regard
                to source
                > control integration and agile methodologies. You'll find the repository
                > model, providing code migration from dev to qa to prod internally to the
                > tool (so to speak) is common to many of them, sometimes w/ internal
                > versioning options.
                >
                > That to the side, I like informatica powercenter for etl. Its got a
                > respectable and growing install base w/ top tier companies including
                > financial services, it works, and it integrates nicely w/ source
                control.
                > All objects can be exported as xml, and any operation you'd want to
                perform
                > on the metadata repository (ie export) can be scripted. I've gone
                much of
                > the way in automating powercenter integration w/ source control and
                it was
                > painless. There are some gotchas around redundancies in the objects with
                > which informatica will export certain properties, but those are
                identifiable
                > and manageable.
                >
                > let me know if you have any questions and I'll do my best to answer.
                sorry
                > to respond so late after your request
                >
                > John Griffin
                > john.g.griffin@...
                >
                > On Thu, Mar 20, 2008 at 7:36 AM, martin_j_andrews <marty@...>
                > wrote:
                >
                > > I posted this message to the extremeprogramming group, and someone
                > > suggested I try here as well.
                > >
                > > I've recently joined the data warehouse group in a banking
                > > organisation as an agile coach. My background is mostly web
                > > application development in Java and Ruby, so the technical domain is
                > > quite new to me. The group is currently evaluating some new ETL tools
                > > for use in its work.
                > >
                > > Yesterday, I sat in a demonstration from IBM of its ETL tool called
                > > DataStage. It's supposed to be a market leader in this space. I
                > > asked some simple (or so I thought) questions about unit/integration
                > > testing, source control & continuous integration. The answers turned
                > > out to be remarkably bad.
                > >
                > > Source control was the worst. The tool uses a custom repository of
                > > code on a server. No local copies. Pessimistic locking on all
                > > 'jobs'. No version history at all. You can see the date and user
                > > that created the job, and the date and user of last change, but that's
                > > it. Any time a job is saved, all previous changes are lost.
                > >
                > > Automated unit testing was almost as bad. Everything seems to be GUI
                > > driven in the fat client app, which makes automation rather difficult.
                > > It seems quite difficult to break up the pieces of a job for unit
                > > tests, and even a little bit tricky to stub in different data sources
                > > for integration tests.
                > >
                > > Those two issues combined make continuous integration practically
                > > impossible.
                > >
                > > I was flabbergasted at how badly this turned out, particularly in
                > > relation to source control.
                > >
                > > Does anyone have some experience with other ETL that differs? Is
                > > there some viable alternatives in this space that would support the
                > > agility of development better? Maybe I'm even giving DataStage a poor
                > > review because I don't understand the 'normal' development style?
                > >
                > > -- Marty Andrews
                > >
                > >
                > >
                >
                > [Non-text portions of this message have been removed]
                >
                >
                >
                >
                >
                >
                > [Non-text portions of this message have been removed]
                >
              Your message has been successfully submitted and would be delivered to recipients shortly.