Print

Print


Hi Mariam,

I believe this is the intentional behavior as of commit e18f494 (https://github.com/hoffmangroup/segway/commit/e18f494d7ba061b4cdfbdc1d3b02cc726c0b947e) from 5 years ago. I think the documentation is just out of date on that section and was not updated correctly.
The justification for the change was the misleading help text at the time (https://hoffmangroup.github.io/segway-bitbucket-backup/#!/hoffmanlab/segway/issues/56/page/1?slug=split-sequences-specifies-number-of-gmtk).

I'm open to feedback on what you believe the appropriate behavior should be, or if I should just update the erroneous documentation.

Eric
________________________________
From: Maxwell Libbrecht <[log in to unmask]>
Sent: Monday, January 4, 2021 5:25 PM
To: Mariam Arab <[log in to unmask]>
Cc: Roberts, Eric <[log in to unmask]>; Maxwell Libbrecht <[log in to unmask]>
Subject: [External] Re: Segway split-sequences issue

Thanks, Mariam. Just to be clear, it looks like Segway is splitting windows according to bp, not frames.

Cheers,
Max


On Mon, Jan 4, 2021 at 2:22 PM Mariam Arab <[log in to unmask]<mailto:[log in to unmask]>> wrote:
Hello Eric and Happy New Year! I hope you had a good holiday.

I encountered a potential bug related to split-sequences.
As per the segway docs,
> The --split-sequences=size option will split up sequences into windows with size frames each. The default size is 2,000,000.
However, this has not been my experience with Segway.

This is the command I used, note that I am working with a 10kb resolution.
segway train-init --resolution=10000 --num-labels=8 --num-instances=10 --segtransition-weight-scale=12.0 --include-coords=eval/include_coords.bed eval/GM12878/GM12878.genomedata GM12878_test/train

The max number of frames in my include_coords.txt file is 11559. This should be okay, since it is lower than the default size of 2,000,000.
awk '{print ($3 - $2)/10000}' include_coords.txt | sort -gr | head
11559
10446
9091
8828
8421
7942
7741
7522
7152
6846


However, during segway train-init, the windows are split to a max of 200 frames. It looks like split-sequences isn't taking the resolution into account
awk '{print ($3 - $2)/10000}' windows.txt | sort -gr | head
200
200
200
200
200
200
200
200
200
200

I am currently using a large split-sequences value to fix this.

Kind regards,
Mariam




--
Maxwell Libbrecht
Assistant Professor
School of Computing Science, Simon Fraser University
TASC 1, office 9219
[log in to unmask]<mailto:[log in to unmask]>
Skype: maxlibbrecht
http://www.cs.sfu.ca/~maxwl/<https://urldefense.com/v3/__http://www.cs.sfu.ca/*maxwl/__;fg!!CjcC7IQ!bbw4c0oJ1Q2AzvNG_CO8lOkwI0hl6QbPF-DISCj1DFlyqayXdHODt7q0oJUMumlFqj7sCYs$>
Calendar: https://www2.cs.sfu.ca/~maxwl/calendar.html<https://urldefense.com/v3/__https://www2.cs.sfu.ca/*maxwl/calendar.html__;fg!!CjcC7IQ!bbw4c0oJ1Q2AzvNG_CO8lOkwI0hl6QbPF-DISCj1DFlyqayXdHODt7q0oJUMumlFjcu85Rs$>


This e-mail may contain confidential and/or privileged information for the sole use of the intended recipient.
Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited.
If you have received this e-mail in error, please contact the sender and delete all copies.
Opinions, conclusions or other information contained in this e-mail may not be that of the organization.

If you feel you have received an email from UHN of a commercial nature and would like to be removed from the sender's mailing list please do one of the following:
(1) Follow any unsubscribe process the sender has included in their email
(2) Where no unsubscribe process has been included, reply to the sender and type "unsubscribe" in the subject line. If you require additional information please go to our UHN Newsletters and Mailing Lists page.
Please note that we are unable to automatically unsubscribe individuals from all UHN mailing lists.