Loading ...
Sorry, an error occurred while loading the content.

Code-switching corpora

Expand Messages
  • Gustav Eje Henter
    Dear code-switching experts, My name is Gustav Henter, and I am a Ph.D. student in signal processing and machine learning at KTH - The Royal Institute of
    Message 1 of 4 , Apr 3, 2012
    • 0 Attachment
      Dear code-switching experts,

      My name is Gustav Henter, and I am a Ph.D. student in signal processing and
      machine learning at KTH - The Royal Institute of Technology in Stockholm,
      Sweden. Currently, I am looking for a large body of roman-alphabet text (>2
      million characters, say) with significant code-switching. I intend to use the
      data to build a character-level Markov chain model for some experiments with a
      computer algorithm I am researching.

      Do you know where such a corpus or corpora can be obtained in digital form?

      I have already experimented with Vivan de Klerk's Xhosa English corpus (thanks,
      Vivian!), but it had too low a rate of code-switching---about 1 switch for every
      500 words---to show any significant difference in my experiments.

      Best regards,
      Gustav Henter

      ======================================================================
      Gustav Eje Henter, Ph.D. student E-mail: gustav.henter@...
      Sound and Image Processing Lab, EES, Web: http://www.ee.kth.se/sip/
      KTH - Royal Institute of Technology Phone: (+46) 8 790 7420
      Osquldas väg 10, SE-100 44 Stockholm, SWEDEN Office: A:327, floor 3
      ======================================================================
    • leonikotze
      Dear Gustav, I suppose literature would be your best bet. I have a play by Athol Fugard entitled Boesman and Lena . It is FULL of code-switching between
      Message 2 of 4 , Apr 4, 2012
      • 0 Attachment
        Dear Gustav,

        I suppose literature would be your best bet. I have a play by Athol Fugard entitled 'Boesman and Lena'. It is FULL of code-switching between English and Afrikaans. I think the play might be ideal for you, as the book in which the play appears, has a glossary which explains the meaning of the Afrikaans terms. I will scan the play for you if you wish and send it to your private email. Kindly let me know whether you would like me to do this.

        Kindest Regards
        Leoni Kotze, South Africa

        --- In code-switching@yahoogroups.com, Gustav Eje Henter <ghe@...> wrote:
        >
        > Dear code-switching experts,
        >
        > My name is Gustav Henter, and I am a Ph.D. student in signal processing and
        > machine learning at KTH - The Royal Institute of Technology in Stockholm,
        > Sweden. Currently, I am looking for a large body of roman-alphabet text (>2
        > million characters, say) with significant code-switching. I intend to use the
        > data to build a character-level Markov chain model for some experiments with a
        > computer algorithm I am researching.
        >
        > Do you know where such a corpus or corpora can be obtained in digital form?
        >
        > I have already experimented with Vivan de Klerk's Xhosa English corpus (thanks,
        > Vivian!), but it had too low a rate of code-switching---about 1 switch for every
        > 500 words---to show any significant difference in my experiments.
        >
        > Best regards,
        > Gustav Henter
        >
        > ======================================================================
        > Gustav Eje Henter, Ph.D. student E-mail: gustav.henter@...
        > Sound and Image Processing Lab, EES, Web: http://www.ee.kth.se/sip/
        > KTH - Royal Institute of Technology Phone: (+46) 8 790 7420
        > Osquldas väg 10, SE-100 44 Stockholm, SWEDEN Office: A:327, floor 3
        > ======================================================================
        >
      • Anna S
        Dear Gustav, I recommend you to have a look at www.talkbank.org, especially http://talkbank.org/data/BilingBank/ On that website you find monolingual and
        Message 3 of 4 , Apr 4, 2012
        • 0 Attachment
          Dear Gustav,

          I recommend you to have a look at www.talkbank.org, especially http://talkbank.org/data/BilingBank/

          On that website you find monolingual and bilingual spoken data, sometimes even with the original audio files available.

          I used the eppler corpus for a study on the grammatical side of codeswitching, it contains cs german-english. on the website you will also find other language pairs; some even with more than two languages switching.

          Good luck!

          Best regards from Bremen, Germany,

          Anna (M.A. linguistics)



          To: code-switching@yahoogroups.com
          From: ghe@...
          Date: Tue, 3 Apr 2012 18:34:48 +0200
          Subject: [code-switching] Code-switching corpora




























          Dear code-switching experts,



          My name is Gustav Henter, and I am a Ph.D. student in signal processing and

          machine learning at KTH - The Royal Institute of Technology in Stockholm,

          Sweden. Currently, I am looking for a large body of roman-alphabet text (>2

          million characters, say) with significant code-switching. I intend to use the

          data to build a character-level Markov chain model for some experiments with a

          computer algorithm I am researching.



          Do you know where such a corpus or corpora can be obtained in digital form?



          I have already experimented with Vivan de Klerk's Xhosa English corpus (thanks,

          Vivian!), but it had too low a rate of code-switching---about 1 switch for every

          500 words---to show any significant difference in my experiments.



          Best regards,

          Gustav Henter



          ======================================================================

          Gustav Eje Henter, Ph.D. student E-mail: gustav.henter@...

          Sound and Image Processing Lab, EES, Web: http://www.ee.kth.se/sip/

          KTH - Royal Institute of Technology Phone: (+46) 8 790 7420

          Osquldas v�g 10, SE-100 44 Stockholm, SWEDEN Office: A:327, floor 3

          ======================================================================
















          [Non-text portions of this message have been removed]
        • Gustav Eje Henter
          Hello Anna, This was a very interesting link indeed. I am right now experimenting with the BilingBank Blum Snow corpus, and have obtained an interesting Markov
          Message 4 of 4 , Apr 5, 2012
          • 0 Attachment
            Hello Anna,

            This was a very interesting link indeed. I am right now experimenting with the
            BilingBank Blum Snow corpus, and have obtained an interesting Markov chain model
            which regularly mixes English and Hebrew.

            Thanks to everyone who has shared their advice and/or data so far!

            Best regards,
            Gustav


            PS. Just for fun, here is a (lightly edited) random sample from a
            character-level 5-gram Markov chain based on the Blum Snow data:

            eha'avir et ze anashim your fork. al ha# matay maspik. bananot ve# pit'om here's
            a twenty-five about he's probably where write lots of them tembelit. lo ta'avir

            On 2012-04-04 18:14, Anna S wrote:
            > Dear Gustav,
            >
            > I recommend you to have a look at www.talkbank.org, especially http://talkbank.org/data/BilingBank/
            >
            > On that website you find monolingual and bilingual spoken data, sometimes even with the original audio files available.
            >
            > I used the eppler corpus for a study on the grammatical side of codeswitching, it contains cs german-english. on the website you will also find other language pairs; some even with more than two languages switching.
            >
            > Good luck!
            >
            > Best regards from Bremen, Germany,
            >
            > Anna (M.A. linguistics)
            >
            > To: code-switching@yahoogroups.com
            > From: ghe@...
            > Date: Tue, 3 Apr 2012 18:34:48 +0200
            > Subject: [code-switching] Code-switching corpora
            >
            > Dear code-switching experts,
            >
            > My name is Gustav Henter, and I am a Ph.D. student in signal processing and
            >
            > machine learning at KTH - The Royal Institute of Technology in Stockholm,
            >
            > Sweden. Currently, I am looking for a large body of roman-alphabet text (>2
            >
            > million characters, say) with significant code-switching. I intend to use the
            >
            > data to build a character-level Markov chain model for some experiments with a
            >
            > computer algorithm I am researching.
            >
            >
            >
            > Do you know where such a corpus or corpora can be obtained in digital form?
            >
            >
            >
            > I have already experimented with Vivan de Klerk's Xhosa English corpus (thanks,
            >
            > Vivian!), but it had too low a rate of code-switching---about 1 switch for every
            >
            > 500 words---to show any significant difference in my experiments.
            >
            >
            >
            > Best regards,
            >
            > Gustav Henter
            >
            >
            >
            > ======================================================================
            >
            > Gustav Eje Henter, Ph.D. student E-mail: gustav.henter@...
            >
            > Sound and Image Processing Lab, EES, Web: http://www.ee.kth.se/sip/
            >
            > KTH - Royal Institute of Technology Phone: (+46) 8 790 7420
            >
            > Osquldas väg 10, SE-100 44 Stockholm, SWEDEN Office: A:327, floor 3
            >
            > ======================================================================
            >
            >
            >
            >
            >
            >
            >
            >
            >
            >
            >
            >
            >
            >
            >
            >
            > [Non-text portions of this message have been removed]
            >
            >
            >
            > ------------------------------------
            >
            > To Post a message: code-switching @ yahoogroups.com
            > To Unsubscribe, send a blank message to:
            > code-switching-unsubscribe @ yahoogroups.com
            > Web page: http//groups.yahoo.com/group/code-switchingYahoo! Groups Links
            >
            >
            >
          Your message has been successfully submitted and would be delivered to recipients shortly.